Rest in Peas: The Unrecognized Death of Speech Recognition

Pushing up daisies (Photo courtesy of Creative Coffins)

 Mispredicted Words, Mispredicted Futures

The accuracy of computer speech recognition flat-lined in 2001, before reaching human levels. The funding plug was pulled, but no funeral, no text-to-speech eulogy followed. Words never meant very much to computers—which made them ten times more error-prone than humans. Humans expected that computer understanding of language would lead to artificially intelligent machines, inevitably and quickly. But the mispredicted words of speech recognition have rewritten that narrative. We just haven’t recognized it yet.

After a long gestation period in academia, speech recognition bore twins in 1982: the suggestively-named Kurzweil Applied Intelligence and sibling rival Dragon Systems. Kurzweil’s software, by age three, could understand all of a thousand words—but only when spoken one painstakingly-articulated word at a time. Two years later, in 1987, the computer’s lexicon reached 20,000 words, entering the realm of human vocabularies which range from 10,000 to 150,000 words. But recognition accuracy was horrific: 90% wrong in 1993. Another two years, however, and the error rate pushed below 50%. More importantly, Dragon Systems unveiled its Naturally Speaking software in 1997 which recognized normal human speech. Years of talking to the computer like a speech therapist seemingly paid off.

However, the core language machinery that crushed sounds into words actually dated to the 1950s and ‘60s and had not changed. Progress mainly came from freakishly faster computers and a burgeoning profusion of digital text.

Speech recognizers make educated guesses at what is being said. They play the odds. For example, the phrase “serve as the inspiration,” is ten times more likely than “serve as the installation,” which sounds similar. Such statistical models become more precise given more data. Helpfully, the digital word supply leapt from essentially zero to about a million words in the 1980s when a body of literary text called the Brown Corpus became available. Millions turned to billions as the Internet grew in the 1990s. Inevitably, Google published a trillion-word corpus in 2006. Speech recognition accuracy, borne aloft by exponential trends in text and transistors, rose skyward. But it couldn’t reach human heights.

Source: National Institute of Standards and Technology Benchmark Test History 

“I’m sorry, Dave. I can’t do that.”

In 2001 recognition accuracy topped out at 80%, far short of HAL-like levels of comprehension. Adding data or computing power made no difference. Researchers at Carnegie Mellon University checked again in 2006 and found the situation unchanged. With human discrimination as high as 98%, the unclosed gap left little basis for conversation. But sticking to a few topics, like numbers, helped. Saying “one” into the phone works about as well as pressing a button, approaching 100% accuracy. But loosen the vocabulary constraint and recognition begins to drift, turning to vertigo in the wide-open vastness of linguistic space.

The language universe is large, Google’s trillion words a mere scrawl on its surface. One estimate puts the number of possible sentences at 10570.  Through constant talking and writing, more of the possibilities of language enter into our possession. But plenty of unanticipated combinations remain which force speech recognizers into risky guesses. Even where data are lush, picking what’s most likely can be a mistake because meaning often pools in a key word or two. Recognition systems, by going with the “best” bet, are prone to interpret the meaning-rich terms as more common but similar-sounding words, draining sense from the sentence.

Strings, heavy with meaning. (Photo credit: t_a_i_s)

Statistics veiling ignorance

Many spoken words sound the same. Saying “recognize speech” makes a sound that can be indistinguishable from “wreck a nice beach.” Other laughers include “wreck an eyes peach” and “recondite speech.” But with a little knowledge of word meaning and grammar, it seems like a computer ought to be able to puzzle it out. Ironically, however, much of the progress in speech recognition came from a conscious rejection of the deeper dimensions of language. As an IBM researcher famously put it: “Every time I fire a linguist my system improves.” But pink-slipping all the linguistics PhDs only gets you 80% accuracy, at best.

In practice, current recognition software employs some knowledge of language beyond just the outer surface of word sounds. But efforts to impart human-grade understanding of word meaning and syntax to computers have also fallen short.

We use grammar all the time, but no effort to completely formalize it in a set of rules has succeeded. If such rules exist, computer programs turned loose on great bodies of text haven’t been able to suss them out either. Progress in automatically parsing sentences into their grammatical components has been surprisingly limited. A 1996 look at the state of the art reported that “Despite over three decades of research effort, no practical domain-independent parser of unrestricted text has been developed.” As with speech recognition, parsing works best inside snug linguistic boxes, like medical terminology, but weakens when you take down the fences holding back the untamed wilds. Today’s parsers “very crudely are about 80% right on average on unrestricted text,” according to Cambridge professor Ted Briscoe, author of the 1996 report. Parsers and speech recognition have penetrated language to similar, considerable depths, but without reaching a fundamental understanding.

Researchers have also tried to endow computers with knowledge of word meanings. Words are defined by other words, to state the seemingly obvious. And definitions, of course, live in a dictionary. In the early 1990s, Microsoft Research developed a system called MindNet which “read” the dictionary and traced out a network from each word out to every mention of it in the definitions of other words.

Words have multiple definitions until they are used in a sentence which narrows the possibilities. MindNet deduced the intended definition of a word by combing through the networks of the other words in the sentence, looking for overlap. Consider the sentence, “The driver struck the ball.” To figure out the intended meaning of “driver,” MindNet followed the network to the definition for “golf” which includes the word “ball.” So driver means a kind of golf club. Or does it? Maybe the sentence means a car crashed into a group of people at a party.

To guess meanings more accurately, MindNet expanded the data on which it based its statistics much as speech recognizers did. The program ingested encyclopedias and other online texts, carefully assigning probabilistic weights based on what it learned. But that wasn’t enough. MindNet’s goal of “resolving semantic ambiguities in text,” remains unattained. The project, the first undertaken by Microsoft Research after it was founded in 1991, was shelved in 2005.

Can’t get there from here

We have learned that speech is not just sounds. The acoustic signal doesn’t carry enough information for reliable interpretation, even when boosted by statistical analysis of terabytes of example phrases. As the leading lights of speech recognition acknowledged last May, “it is not possible to predict and collect separate data for any and all types of speech…” The approach of the last two decades has hit a dead end. Similarly, the meaning of a word is not fully captured just by pointing to other words as in MindNet’s approach. Grammar likewise escapes crisp formalization.  

To some, these developments are no surprise. In 1986, Terry Winograd and Fernando Flores audaciously concluded that “computers cannot understand language.” In their book, Understanding Computers and Cognition, the authors argued from biology and philosophy rather than producing a proof like Einstein’s demonstration that nothing can travel faster than light. So not everyone agreed. Bill Gates described it as “a complete horseshit book” shortly after it appeared, but acknowledged that “it has to be read,” a wise amendment given the balance of evidence from the last quarter century.

Fortunately, the question of whether computers are subject to fundamental limits doesn’t need to be answered. Progress in conversational speech recognition accuracy has clearly halted and we have abandoned further frontal assaults. The research arm of the Pentagon, DARPA, declared victory and withdrew. Many decades ago, DARPA funded the basic research behind both the Internet and today’s mouse-and-menus computer interface. More recently, the agency financed investigations into conversational speech recognition but shifted priorities and money after accuracy plateaued. Microsoft Research persisted longer in its pursuit of a seeing, talking computer. But that vision became increasingly spectral, and today none of the Speech Technology group’s projects aspire to push speech recognition to human levels.

Cognitive dissonance

We are surrounded by unceasing, rapid technological advance, especially in information technology. It is impossible for something to be unattainable. There has to be another way. Right? Yes—but it’s more difficult than the approach that didn’t work. In place of simple speech recognition, researchers last year proposed “cognition-derived recognition” in a paper authored by leading academics, a scientist from Microsoft Research and a co-founder of Dragon Systems. The project entails research to “understand and emulate relevant human capabilities” as well as understanding how the brain processes language. The researchers, with that particularly human talent for euphemism, are actually saying that we need artificial intelligence if computers are going to understand us.

Originally, however, speech recognition was going to lead to artificial intelligence. Computing pioneer Alan Turing suggested in 1950 that we “provide the machine with the best sense organs that money can buy, and then teach it to understand and speak English.” Over half a century later, artificial intelligence has become prerequisite to understanding speech. We have neither the chicken nor the egg.

Speech recognition pioneer Ray Kurzweil piloted computing a long way down the path toward artificial intelligence. His software programs first recognized printed characters, then images and finally spoken words. Quite reasonably, Kurzweil looked at the trajectory he had helped carve and prophesied that machines would inevitably become intelligent and then spiritual. However, because we are no longer banging away at speech recognition, this new great chain of being has a missing link.

That void and its potential implications have gone unremarked, the greatest recognition error of all.  Perhaps no one much noticed when the National Institute of Standards Testing simply stopped benchmarking the accuracy of conversational speech recognition. And no one, speech researchers included, broadcasts their own bad news. So conventional belief remains that speech recognition and even artificial intelligence will arrive someday, somehow. Similar beliefs cling to manned space travel. Wisely, when President Obama cancelled the Ares program, he made provisions for research into “game-changing new technology,” as an advisor put it. Rather than challenge a cherished belief, perhaps the President knew to scale it back until it fades away.

Source: Google

Speech recognition seems to be following a similar pattern, signal blending into background noise. News mentions of Dragon System’s Naturally Speaking software peaked at the same time as recognition accuracy, 1999, and declined thereafter. “Speech recognition” shows a broadly similar pattern, with peak mentions coming in 2002, the last year in which NIST benchmarked conversational speech recognition.

With the flattening of recognition accuracy comes the flattening of a great story arc of our age: the imminent arrival of artificial intelligence. Mispredicted words have cascaded into mispredictions of the future. Protean language leaves the future unauthored.



Dude, where's my universal translator? (CBC radio show)

Dutch translation of Rest in Peas: De onbegrepen dood van spraakherkenning

Ray Kurzweil does not understand the brain

96 responses
I don't think speech recognition can work so long as it's just flat out NERDY-LOOKING to be seen talking to a computer in public. The technology will eventually catch up, but will probably just be used in a Star Trek like fashion of turning on the lights or heating up tea or something like that. Just my two cents because we can never be quite sure where the tides of fortune may take us.

Also, the idea of privacy when speaking to a computer is lost. It's like voluntarily writing your phone number on what kind of moron would do that? Privacy wise speech recognition doesn't work.

One thing that's interesting is that text-generated speech is getting better at a pretty fast rate. The newer Alex voice in Leopard/Snow Leopard takes breathes before sentences and sounds fairly realistic in my humble opinion.

Interesting article Thanks.

The speed recognition on my Droid phone is very effective for google searches and navigation. For those purposes the current state of the art seems adequate.
Whew! Had to sign up and jump thru hoops just to say ...

All well and good -- though there's no mention of economic factors. The state of consumer-level speech recognition (SR) HAD been modestly healthy ... until Lernout & Hauspie (the Enron of the SR world) bought out all its competition, ruining all research momentum before flaming out. This was years ago, but the industry has yet to recover!

Hey there! There are a lot of points I'd like to address in your write up. Let me start with a disclaimer: I work for Nuance Communications, the leading global vendor of speech recognition technology. I also worked at Dragon Systems when NaturallySpeaking was created, and product managed versions 4 and 5 of the software. I studied speech recognition briefly under Victor Zue and Jim Glass at MIT before I made it a career.

First of all, any discussion of speech recognition is useless without defining the task--with the references to Dragon I'll assume we're talking about large vocabulary speaker dependent general purpose continuous automatic speech recognition (ASR) using a close-talking microphone. Remember that that "speech recognition" is successfully used for other tasks from hands-free automotive controls to cell phone dialing to over-the-phone customer service systems. For this defined task, accuracy goes well beyond the 20% WERR cited here. Accuracy even bests that for speaker independent tasks in noisy environments without proper microphones, but of course those have constricted vocabularies making them easier tasks. In some cases, you write about the failure to recognize "conversational speech," which is a different task involving multiple speakers and not being aware of an ASR system trying to transcribe words. Software products such as Dragon do not purport to accomplish this task; for that, you need other technologies which are still tackling this task.

Re: "The core language machinery had not changed since the 50s and 60s": Actually, it was the Bakers' reliance on Hidden Markov Models (HMM) that made NaturallySpeaking possible. Where other ASR attempts focused on either understanding words semantically (what does this word mean?) or on word bigram and trigram patterns (which words are most likely to come next?), both techniques you described, the HMM approach at the phoneme level was far more successful. HMM's are pretty nifty; it's like trying to guess what's happening in a baseball game by listening to the cheers of the crowd from outside the stadium. Other attempts have proven not to be as successful, such as your citing of MindNet's attempts with Microsoft, and have rightfully been abandoned.

Re: "accuracy of speech recognition has flatlined since 2001" -- the only data I see supporting this claim is the NIST data you cited through 2002. The NIST tests rely on only a few ASR engines, sometimes in suboptimal settings like for "broadcast data" for people not aware they're talking to a speech system. The Dragon engine is never mentioned in the tests. And the data showed continual improvement for most tasks over the years. I don't know why NIST stopped benchmarking, but Dragon hasn't stopped selling -- v10 was released last year with remarkable reviews by tech writers such as David Pogue in the New York Times. I've seen Nuance's lab data showing WERR levels steadily decreasing with each new release of the Dragon product, as well as for our other recognition products such as the over-the-phone recognition. It's to the point where recognition is flat lining--at the 99% level for certain speakers. The challenge these days is to get a wider variety of speakers to achieve the same level of success.

Re: the 2006 study showing WERR at 80%. That was an interesting study, though the ASR vendors aren't mentioned. However, I did notice that "one must acknowledge that minimal user training and variations in
input style (read and spontaneous) contributed to this." The 80% was an average over several tasks, including tasks that the software was not designed to handle (one task had 3% accuracy!). This is like saying that the accuracy of a hammer is only 33%, since it was able to pound a nail but failed miserably at fastening screws and stapling papers together.

Re: Ted Briscoe's 1996 study on the state of the art citing that speech recognition hasn't improved despite decades of research. NaturallySpeaking v1 debuted on April 2, 1997. Similarly, you equate the drop in news articles about NaturallySpeaking with a flatlining of accuracy rate. I don't think you can logically relate one to the other.

To fans of speech recognition, there's hope. Academic study of ASR accuracy can only get you so far. What matters is whether real live people can use it. Saying that no one uses speech is like saying no one goes to that restaurant because it's too crowded. Nuance Communications automates over 9 billion phone calls a year based on speech recognition technology. Its transcription services are used in hospitals throughout the U.S. Advancements are made with every new product release, often with a focus more on usability than on accuracy. It's often more important to train users how to use speech than it is to train speech systems how to recognize users. It's very easy to have a negative first experience with ASR -- trying to pound the screw with a hammer -- and to write off the whole technology. And yes, it does make mistakes. But overall it's a net gain, especially compared to the average typing speed of most users.

I hope you found my comments helpful. I'm happy to speak to you and others about what I know of the history of speech recognition.

State of the art wide coverage parsers are currently sitting around 88-95% accuracy, not 80%, with >99% coverage (meaning a successful, though possibly incorrect, parse of 99% of unlabelled unrestricted text). The NLP parsing field is very active; I don't follow any developments in speech recognition, but don't be fooled into thinking that NLP is stagnant!
The bit about parser accuracy is missing an important point. If you ask human linguists to parse a sentence, you only get consensus on 95% of the parses. So if humans can only do 95%, why would we expect machines to do better?
Very interesting post! I don't really believe it's dead as it could make a real comeback with things like Project Natal, Wii etc and more with further advances of human-machine interface technologies as speech will have to be there alongside touch and sight.
I had some experience with speech recognition models as a post-doc in a language research department in the mid 90s. I agree with Jeff F. above that the phoneme-based models are much more promising. What struck me as strangest at the time was that the researchers doing the work felt that the acoustic properties of speech (fundamental frequency, etc.) could be ignored. At that point I was baffled with what exactly they were trying to recognize, if not an audible signal with measurable properties. How they were defining the object of their highly developed, bells and whistles neural networks escaped me.

They seemed to think they could go straight from applying what works in reading (individual characters largely ignored, whole words recognized). Infants learning speech progress from reacting to individual sounds (phoneme discrimination as early as 3 months) and prosody (raise your voice at the end of a question?). Babyspeak (or "motherese") emphasizes these traits. Leaving me with the question (for these past 15 years) of why modelers don't train up their systems in a similar fashion.

But then again, I'm a Ph.D. linguist....
Thanks for the post and for the comments.

I've been in software development for 14 years, many of those at contract development agencies building custom applications for clients. There are a handful of phrases that make me stop taking a client seriously, and "speech recognition" is at the top of that list (unless the phrase is followed by "is bullsh*t").

Entrepreneurs who want to use speech recognition in a product should abandon the idea of doing it in software; if for some reason you MUST have speech-to-text functionality, it's cheaper, easier, and more reliable to use a transcription service.

Use... ... Guided Speech recognition IVRs combined with humans handling the errors = better results...
"Entrepreneurs who want to use speech recognition in a product should abandon the idea of doing it in software; if for some reason you MUST have speech-to-text functionality, it's cheaper, easier, and more reliable to use a transcription service."

AMEN. This country trusts thousands of its people's lives to speech recognition every single day; the lucky ones have that text at least edited by trained medical transcriptionists, who are specialists in medical language, nuance, and deciphering and correcting the dictation errors of exhausted physicians and surgeons. The unlucky end up with a medical record that tells nothing of their personal story, does not connect the dots between their symptoms, does not allow the physician to speak in his or her natural "voice," and is full of gibberish that could actually be putting the patient's life at risk because it can't tell one drug name from another or that there is no way in the world the patient with kidney problems should ever be dosed with an NSAID.

And why is this happening? Overhead, plain and simple. Transcriptionists now require more education and more experience to be paid far less than they were 10 or even 5 years ago. These are people who are working on production - a single digit cents per line - and most are doing it because they love the work, the learning, and their ability to help in some way to providing patient care. And that is one of the biggest reasons that speech recognition in healthcare is a giant detriment to our society: Computers can be trained to predict, they can compile a tremendous amount of data in order to promote better statistics by which to judge the dictator, but they can never be taught to care about the patient. They can never have the drive and ambition and COMPASSION to actually learn more about the a particular field or disease in order to be of better service to their patients.

Give me the person behind the screen who just might have some empathy for my spouse or my child who is on the operating table. Dump the big metal box.

I may not have the same long-term experience as others in these comments but I do have some. I have worked in a disability resource lab for the last 4 years and work regularly with both Kurzweil 3000 (a text reader) and Dragon Naturally Speaking as tools for disabled students. To say that speech recognition is dead is a gross misstatement.

In the short time I have been involved with these programs, particularly with Dragon, I have seen some vast improvements; including the adaption of a regional dialect feature for people with heavy accents. Dragon Naturally Speaking totes up to a 98% accuracy rating, which I don't particularly believe to be true. However, I have seen easily around a 95% accuracy rating using their software. Starting with just their 8 minute initial setup I would say most people's accuracy rating lies around 90%. The software is very intuitive and adaptive to your needs. The fact that I have been able to do everything from navigating the operating system itself to dictating and editing written paragraphs entirely by my voice allows me to believe that while public perception of speech recognition's has fallen, the progress companies like Dragon and Kurzweil have made should still be considered an amazing achievement.

Speech Recognition is not AI. Nor will it ever develop by itself into being some form of artificial intelligence. Sure SR may be able to be used to assist with the implementation and development of AI but those days are still far ahead of us.

I'd written off speech recognition myself--at least for use in query interfaces--until getting an Android phone. I find that the recognition is accurate enough that I prefer it over typing and often use it as a first resort for search queries. Yes, it helps that Google knows something about query distribution--so perhaps this isn't as hard a problem as what NIST or DARPA were testing. But there's no question that speech recognition gives me practical value--and that's even with the latency of server-side processing!

Full-disclosure: I work for Google and may be favorably predisposed toward my employer's products--but I haven't had any role in Google's speech recognition efforts.

Its a long blog and sums up the ASR problems and its ultimate(!!) fate. However it must be also taken into account that the understanding of speech signal itself is not complete with such development of technologies and tools of analysis. One important area ASR can be very useful will be in the robot communication or futuristic communication. Its the age of robots, where you would like to talk to your robot and it responds to you. For this at least ASR is needed. All the time you will not want the human sized robot to be contrlled by a small remote (!!). It is one such application. Many such applications unseen will openup once the technology is faster (at this time it is not) and reliable(not yet there still). Hence it may be some 10 years down the line. Till that time, industry guys just wait and do with those mediocre softwares which puts you to speech :).
I was inspired by the vagaries of voice recognition software to write a poem in the early nineties. After I composed the first piece, I read it back to the computer and the second version is what it thought it heard!

In 2006, I tested Dragon Naturally Speaking 8, and with just five minutes training, it only got two words incorrect.
Read the poem, surreal voice recognition, at, and see what you think.

Perhaps we should think in terms of "voice control" instead of all out regcognition? For many applications, only a simple interface is needed - I think the hunt for the perfect has wrecked the implementation of the good.
I don't find automated speech recognition rates any lower than that of most teenage boys - I mean how often are they looking at me in utter incomprehension after I elucidate an important point?
Those of us in the Medical Transcription field heard all kinds of predictions to wit: all doctors would soon be using VR and would no longer need human scribes. In the last few years, the tide has most definitely been reversed and clients are abandoning their costly Dragon systems when they proved too difficult to manage. One MD I know spent over four hours every night cleaning up the mess his Dragon system had created... interesting, given that it was purchased to SAVE him time. He now uses a human service.

So far there is no substitute for the human brain when it comes to content-heavy dictation. It's one thing to use VR for simple command-based usage; quite another when things as important and specific (as well as context-dependent) as medical records. Never underestimate the human factor in keeping communication clear. For the Dragon system to work well for those in Healthcare, it is necessary that the user have organized speech patterns with limited use of colloquial language--and I've dealt with very few practitioners whose verbal skills are up to the challenge.

My goodness... I'm truly flabbergasted... pardon me for saying so but this is utter nonsense.

I've been using Dragon NaturallySpeaking since its early releases, and it has been steadily increasing in accuracy. I used to be a senior engineer at Intel, demonstrating technology on stage with chairman Andy Grove at events all over the world. We used a lot of speech dictation demos and they were so accurate that people thought we were faking them.

if you read the directions carefully, and if you use a USB microphone, the recognition is uncanny. In fact, I'm using Dragon NaturallySpeaking Preferred Edition version 10.0 to dictate this.

in addition to extremely accurate voice dictation, there are those really cool commands, like being able to say something like "search Google for Balloon Boy" or something like that and having it automatically open up your browser and enter the search term -- something like this is accomplished many times faster than a human could do it. Or, being able to total up a column of numbers in Microsoft Excel by saying simply "total this column" and seeing the results in a blink of an eye, literally.

I've used Dragon NaturallySpeaking successfully on airplanes, despite the constant high-pitched whine. I also have a Viliv pocket computer which I can whip out and dictate on anytime I'm waiting for somebody or at a bus stop or some other idle time -- it even works when I'm in a traffic jam.

I can also dictate into my phone, pull out the SD card, insert it into my laptop and have Dragon transcribe what I said. I've even tried this walking down the street in Manhattan, and you won't find a noisier place anywhere in the world.

Anyone who doesn't believe voice dictation is accurate is welcome to come hang out with me for a day in New York and I will totally change your mind.

Dan Nainan
Professional Comedian/Actor

I tried Dragon back in the early 2000's when I was suffering terrible tendinitis in my wrists, clearly a problem for a software engineer. I soon found that it required me to speak in a monotone (which began to carry over into my interpersonal speech) and that I ended my day with a sore throat *and* wrists.

I can't see any use for computer speech recognition other than transcription, which really isn't of use for most of us. The only mainstream consumer-facing applications I can see are things like in-car navigation systems and automated phone systems, both of which have quite restricted contexts.

Of course, there are outlying uses like the disabled, but as usual those aren't cash crop markets and, thusly, attract little attention.

Jeff, I forgot to address that in my original post.

Every time I tell somebody about Dragon NaturallySpeaking, the response is always the same -- "I tried that a few years ago, and it doesn't work". That's analogous to somebody saying "I tried the Motorola brick phone in the 1980s and it's just too heavy" and giving up on all cell phones forever. Technology does make things better over time, and Dragon is no exception.

What drew me to Dragon is that at Intel, I was also suffering from wrist pain. I can't imagine having to type everything that I type by hand... I would truly be unable to do my work. And I speak into my computer at a normal conversational tone, not a monotone. For me, the software is 98% to 99% accurate...even if it were only 95% accurate, that means you would only have to type 5% of the time, which is a lot less typing, especially if you have wrist pain.

I also forgot to mention that in addition to my Logitech USB headset, I also got a Logitech *wireless* USB headset, which is fantastic... I can walk around the room, pacing, while I write my book... I can also answer calls on Skype using the same headset, it's very convenient.

I think someone has summed up this rubbish rather well on the no brainer speech forum. See the link below. Have already been using for four months, and able-bodied and it is a godsend. You guys need to check it out and learn how to dictate.

Mr Fortner seems to equate Speech Recognition with Dictation.
Dictation is only one of many SR applications, and certainly not the most common. For example, almost every cell phone has "Call <name>" and many cars have "Play <name>". Call Center technology is still doing well. On the horizon, companies are using Speech Recognition to judge pronunciation of people trying to learn languages. These other markets are doing just fine.

Yes, Dictation is a mature market and technology. Yes, Dictation has done most of its growing. But then again, how many technologies from the 80's are still around, and still selling anything.

So, I think his article should be titled "Death of Dictation", and I agree... But Dragon went out of business in 2001, who cares that interest in "Dragon Naturally Speaking" is declining.

I'm a certified CART provider in New York City. That means I transcribe college classes and other events in realtime for people who are Deaf or hard of hearing. My steno machine lets me write whatever's spoken at up to 240 words per minute, and the words appear instantly in English on my computer screen, a fraction of a second after they're spoken.

A lot of people still ask me, "What are you going to do when your job become obsolete in the next ten years?" This article pretty much sums up my position on the matter, which is to say I'm not too worried about it. Some of the commenters quibble with these numbers, but the essential point is pretty much indisputable: Until machines actually understand language the way humans do, humans are always going to be better at transcribing it. In practical terms, that means that while machine speech recognition can be useful if a human is directly dictating to the software and is willing to stop dictating, edit, and then continue whenever it makes mistakes, machine speech recognition is not at all useful in the situations I work in, where the speaker is untrained, unwilling to stop and edit, and usually speaking to a larger audience in uncontrolled conditions, and where the recipient of the transcript does not have the ability to distinguish between correct translations and errors.

There has been significant improvement in speech recognition technology over the last 20 years, and it will continue to improve, but the improvement is asymptotic, as the chart makes very clear; current corpus-based probability algorithms will result in increased accuracy for typical phrases, but that improvement, by definition, will inevitably drive down the accuracy of atypical phrases. Since the devil is in the details, and people aren't usually content with "Well, that's what most people say, most of the time!" when they get a prediction-based mistranslation, I'm counting on getting work for a good long time. Check back in ten years and we'll see if I'm right.

(PS: Another advantage of stenographic input over text, as many people have mentioned already, is that it's silent. Commercial steno software is currently far too expensive for non-professional use, but I'm trying to change that by developing an open source version that works with a $60 qwerty keyboard. Check out the Plover Blog for more details.)

Mirabai you are making exactly the arguments, which most people now accept, which is that emulated human artificial intelligence is dead. This is completely different from speech recognition and what it portrays to be. Don't get the two confused like the blogger. This article does nothing but try to imply that speech recognition is a dead duck because artificial intelligence is. If some proper research was done he would see that speech recognition is gaining in popularity and accuracy on a logarithmic scale, it does not portray to be cognitive in any way Sheila
Grrr... There seems to be no way to edit on this site

In my comment I, of course, meant "Mr Fortner seems to equate Speech Recognition with Desktop Dictation".

There are lots of successful uses of Large Vocabulary Speech Recognition off the Desktop, as Vlingo and Google have demonstrated.

Mobile Search is a huge market and growing rapidly. In a world were computers fit in your hand, Speech Recognition is hardly dead!

The article appears to neglect important speaker-dependent factors like dialect and diction. It quotes human accuracy rates of >98%, but that is certainly not the case for humans listening to an unfamiliar dialect or a foreigner speaking with a thick accent in a non-native tongue. I would bet that in those cases, human recognition borders on the 80% level. So, if computers are not programmed to recognize and account for such differences, how can they be expected to do better?

In fact, I would wager that humans only achieve such high accuracy rates because they are adept at recognizing the patterns inherent in various accents and dialects and using that knowledge to screen out the "bad" alternatives that are not consistent the particular dialect of a particular speaker. So, while "wreck a nice beach" could be confused with "recognize speech" when Sally says it, we would have no problem discerning that Joe really was talking about the beach when he said it.

Further, humans are constantly "learning" (or training) to understand the accent of a particular speaker. Computers would have to do the same and I suspect when this is done for each particular speaker, error rates eventually go way down.

I think, the speech transcription technology has improved a lot. It depends on language, signal and articulation. There are several systems where you gets 95-98% by dictating. One of the Czech company producing the STT system for natural speech (ie. not dictating). They get accuracy about 88-90% for the natural speech via telefone (ie. bad micro, poor articulation, murmling...)... depends on the OOVs etc.

The speech recognition (SR) is more general word and the speech-to-text (STT) and text-to-speech (TTS) only well known parts of it.

The SR means as well the keyword spotting/search in the audio (ie. spoken term detection in audio without transcription); speaker identification/verification/search, language identification, emotion detection, etc. See free demos on . This opens new possibilities for call-centres, multimedia archive etc.

We should stop talking about speech recognition. In fact what you are talking about is "talking recognition". lets' take the sentence you mentioned: "The driver struck the ball" even humans are not able to understand the meaning without knowing the context. So if we hear this sentence and we are intended to react on it, human beings simply must ask. And that leads us to conversation. Human beings need to store the meaning and thats what computers also need to do. Human beings moreover have a more realistic storage system than computers have: we store memories into "feelings". We got to make a computer be able to "feel" so we are able to make him understanding words.
I used Dragon Naturally Speaking 10 to 'speak' 97% of my Thesis. I think the technology has come a LONG way since its conception! I talk to my phone and it does what I say (windows mobile 6.5) with at least 99% accuracy. I have more problems talking to people (and I talk to people every day in my job) than I do my computer. The ONLY problem I see is that people don't understand that talking to computers is OK its not some freakish taboo only left to the geeks and nerds. Until it becomes socially acceptable to do this more main stream then it won’t catch on just like almost everything. Currently if you look at where speech recognition systems currently are placed (phones, cars, PDAs, mp3 players) I think we will see become more common place. It’s just going to take some time. Speech recognition is not dead its just dormant. When it becomes main stream it will be full speed ahead.. Engage ….. (Star Trek Reference for your Non Nerds!)
A huge mistake dominates the way the problem is approached:this is NOT a computer, or informatics, problem, BUT a linguistic problem.It will be solved only when linguists will be able to write the programs that use their knowledge (databases and theory).
What I'd do:give free hand to some hundreds of linguists to hire the teams they feel necessary to accomplish this work.
PS.I owe no academic degree in linguistics.
What is needed is a good voice - data entry system for a database. Dragon is not quite there, unless 'trained'. Now I think that @thought-writer@ will be functioning before a good 'talk-writer'.
I think the problem is if you think that machines may understand anything anytime.
In applications that are not dictating or google search it's probably only a limited set of things you can say, that has no useful meaning. In my doctoral thesis I finished with a site where I cut out this humorous contribution to the debate:
As you see in my previous post translation is not a solved problem. I translated my post from Swedish with Google translate.
My translation would have been:
I think the problem is if you think that machines can understand anything at any time. In applications that are not dictatiion or google search, it is probably only a limited set of things you can say, that makes sense. In my doctoral thesis I finished with a citation to this humorous contribution to the debate:
I am a physician and stockholder in Nuance. I became a stockholder because of my experience with Dragon 10. It is enourmously useful to me and very fast. Perfect no but better than transciption services I have had.
Fascinating information, Robert; I'm still digesting it!

I actually don't disagree with your basic premise, that speech recognition technology hasn't made many advances of late and has in fact, plateaued. I'm not an engineer, myself, but I hear this sentiment echoed a lot among the speech rec engineers I'm in contact with.

That's the issue: speech rec hasn't yet caught up with real, living, breathing, evolving human speech. The accuracy just isn't there, unless you're willing to talk like a machine. (I actually tested Dragon and liked it quite a bit, but you do have to adapt to its parameters or "train" it to understand your speech).

Mind if I do a little shameless self-promotion here? That's why Spoken Communications has been so successful: we let the ASR engine do what it does, understanding the limitations, and then have human beings intervene when the recognition confidence score is low. That's how we've been able to get much better accuracy: we acknowledge the limitations of speech recognition, adapt, and found a way to use human intelligence in real time to get better accuracy.

I'd be very curious to hear your reaction to this solution.

I'd agree with Matthew Wahlquist Stoker - It always takes me a few seconds to 'tune in' to an accent.
Fortunately with technologies like skype and other HD VoIP, we are able to get better quality audio and better user identification which should be able to allow ASR to load a suitable speech pattern for that user based on their previous interactions.
Having tried to use Dragon, as someone who makes their living as a writer, I think I can point to one of the major problems.

In order to actually write for a living you have to be using the language in new and interesting ways (not that I do all the time, you understand, but that is at least the aim).

If the software is pushing you to the default of what everyone else has already used then it's not going to work for those who wish to develop matters, is it?

As an analogy, it's a bit like following M Word's rules on grammar. You're continually pushed into the passive voice, cannot use the demotic and so on.

No, I don't claim to have a great (or even decent) style but if software is based upon what everyone else already does then you're not going to end up with a new and interesting one, are you?


The Dragon software simply types what you're saying - it's really no different from typing with your hands, just a lot easier. The software does not push you into any style of writing whatsoever. You talk, it types what you're saying.

the anecdotes i've heard on this sad story all relate to ScanSoft & Nuance. after Nuance bought ScanSoft, and with it Dragon Dictate and a veritable horde of acquisitions and IP already owned by ScanSoft, the need to innovate disappeared. there is no other commercial offering on the market, so why bother? they'd bought their only competitor. sit on your fat pile of IP and keep milking it-- anything else would be irresponsible to your stakeholders. Nuance has this game locked up, and everyone in the world who would benefit from radically enhanced voice communication is screwed; there's no means for innovation, because Nuance, owning the vast bulk of voice recognition patents, can legally prevent disruption. in the mean time, they have no need innovate; Dragon, as the only product on the market, sells itself.

the evidence is manifest. up until 2005, ScanSoft had been releasing new major versions of Dragon on a nearly yearly basis. the year Nuance acquired them, releases dried up. after 9 in July 2006, the only other major release has been 10 in 2008, and frankly i saw zero improvement. i cant say for sure, i havent dug deeply into NUAN operating expenses (until recently steady around 30m r&d on 250m revenue quarterly), dont even know where i'd go to dig up old ScanSoft expenses for comparison, but it is obvious the tangible rate of change went drastically down hill the day the only two major players in consumer voice recognition joined up. as this article testifies, its a stagnant product, a stagnant company, and the most obvious explanation is that the merger accomplished exactly what it was designed to do, to entrench Nuance as the one and only player in the speech recognition game, protected from any upstarts by virtue of owning nearly all the patents in the field, permitting them to drop their need to innovate to zero.

the real story here isnt that the problem space is too big or too hairy to make new improvements, that we've reached our natural limits, the real story here is that capitalism and the patent system has systematically screwed over people with hearing disability and businesses with a need for speech recognition.

yet in this entire page, there's only two references to Nuance, both in the comments, and this will be the first reference to ScanSoft.

As an active user of Dragon in a medical practice and EMR, it is a phenomenal tool. It is no better (or worse) than a medical transciptionist, and one needs to edit the output as it is created. It is NOT a tool for creative or complex thought production, but is excellent for generating text in a medical chart. It needs to be spoken to differently than conversational speech, but when correctly used, I get 99+% accurancy. It beats typing or talking to a transcriptionist and waiting 2 days to review the text.
"The software does not push you into any style of writing whatsoever. You talk, it types what you're saying."

I know that's what it's supposed to do. However, when you try to uise rhetorical triks like repetition and so on, or offer a twist on a well known phrase, then it goes a little awry. For of course it's looking for well known phrases, to it thinks that you're using that well known phrase....rather than the artfully construced change to it.

@rektide: Thanks for doing a much more thorough job than I could of recapping this issue's true cause.

It has been amusing, but also annoying, to read comments quibbling over accuracy rates or linguistics technicalities while everyone ignores the obvious.

@everyone else: yes, of course you use Dragon -- after all, it's the only game remaining, which is the entire point. This is really all about corporate consolidation squashing innovation.

Actually Dragon _isn't_ the only game in town.  There is some interesting stuff going on at Ditech, converting voicemails to text. GoogleVoice is doing the same thing - only worse.

However, Nuance *is* the only game in to town for Desktop Dictation. They are also *very* litigious, which is why innovation is only happening in other forms of speech technology.

But Nuance is not completely to blame. They, are actually ScanSoft who purchased the old Nuance and changed their name. ScanSoft is a company that purchases troubled technology, reduces costs, and turns it into a profitable business. It is no surprise that they don't do much R&D.

The real villain is Lernout and Hauspie, which like Enron, was a giant pyramid scheme. Through fraud they were able to pump up their value, and acquired most of the leading Speech IP, which included Dragon and Dictaphone in 2000. They then went bust in 2001 taking the entire Dictation market with them.

ScanSoft, was able to buy Naturally Speaking at a fantastic price. They also got IBM ViaVoice and Nuance cheaply. Since there was no competition, there was no need to do much innovation. They reduced costs, changed their name, and restarted the business.

I'm VP of Dragon R&D at Nuance for the last 10 years. Many posted comments disagree with the basic premises of this article, for a reason. This article covers some interesting historical data but ignores some key progress and evolution is speech research and development in the last 10 years, it is ill informed and seems to embrace controversy. The worst is that Its conclusions are blatantly incorrect.
His fundamental point was that, while speech recognition programs continue to improve, their theoretical limit of improvement remains far below the accuracy rate of a trained human transcriber in uncontrolled conditions. This will continue to be the case unless speech recognition programs develop the ability to synthesize semantic information from what they're transcribing and acquire the ability to fill in the gaps of what is almost inevitably a significantly lossy signal -- that is, untrained, non-dictating natural human speech. Without the ability to infer meaning from unclear or garbled speech by employing an understanding of the context in which it was spoken, software is always going to make more errors than human transcribers. Without the ability to identify and correct those errors, the transcript has the potential to be seriously inaccurate, which can be either catastrophic or irrelevant, depending on why the transcript is needed. Only humans can guess at the missing pieces, and only humans can identify and correct the errors. Until true artificial intelligence equal to human intelligence comes on the scene, human transcription will be superior to machine transcription.

I talk more about this issue, specifically addressing my own specialty, CART (realtime transcription for the Deaf and hard of hearing), in this article:

You know, listening to some of the people on this thread reminds me of people who used to think that the earth was flat, or people who said that computers would never be able to defeat grandmasters at chess, and so on.

The people who vilify Dragon NaturallySpeaking without having tried it remind me of those protesters who were upset about "The Last Temptation of Christ" without even having seen the movie.

Every single time I show somebody a five-minute demo of Dragon NaturallySpeaking, their jaws literally drop. That means that there is a significant gap between the perception of what this program can do and what it actually does.

The training, which so many people seem to think is cumbersome, takes a maximum of seven minutes (I've timed it). All you do is read some text into your computer. How difficult is that?

And the capabilities of the program are astounding, and yes, 10.0 is a significant leap over what 9.0 could do.

Before you knock the program, try will see what I mean.

And as far as the transcription and steno and all that, trust me, five or 10 years down the road, computers will be able to do all of that and more. That's the nature of technology... it moves forward.

If you had read my article, you would have seen that I was absolutely not denying that voice recognition software can be very accurate in ideal conditions when a human is in charge of producing the transcript. That means either that they're dictating the content of what the transcript is meant to contain or they're mirroring the words of a third party in such a way as to give the program clear, unambiguous audio input. Under these conditions, the software performs quite well, and any errors can be corrected by the operator of the software as they occur. My point was that you can't jump from "given a clear audio signal, the software works well" to "human transcribers will eventually become obsolete", because it ignores the many situations in which audio quality is not ideal, speakers or content producers are not wiling or able to match their speech to the program's specifications, or the intended recipient of the transcript is not able to correct the transcript's errors.

This accounts for nearly all realtime audio transcription, and a great deal of offline transcription as well. The only way to produce an accurate transcript from suboptimal audio is to fill in the gaps using context, which requires not only an intuitive understanding of the mechanisms of human language, but also specific knowledge about the content of the audio, which can help distinguish the phrase "that they actually have the patients" in a sentence about the protocol of a clinical trial, versus "that they actually have the patience" in a sentence about measuring impulse control.

Fully unrestricted speech is hard to achieve; Our focus should turn on (multimodal) interaction!
Mperak, I totally agree - we have been experimenting with transcription in Google Wave (see the idea being that we can know the speaker's identity and the context of the call from the rest of the Wave.

It doesn't work very well (yet) - especially on my british accent, but it might in the future.

It seems that most SR efforts to date presumed software running on computers. Because computers are designed to do arithmetic and simple list processing no amount of software can make them cope with large, dense semiotic webs at viable speed and cost. Who is interested in applying a new kind of hardware chip the resolves combinatorial nets at Gbit/sec speed and exhibits constant run time regardless of the size of the web or frequency of coincidence?
Very interesting article (and comments/answers of course) with many references on the critical ASR advances in the last years.
But one issue is that the whole conversation is around English or English-related languages, even though quite similar behaviour has been "witnessed" while developing for other languages. I've have been involved in developing acoustic and language (both grammar-based and statistical ones) models for a couple of non-mainstream languages (e.g. Greek, Turkish, Arabic) with not so large training corpora available, in the past..
Another issue would be the differences in non-USA market (particularly the European one) where "monopolies" are not so strong as the ones that are mentioned.
I just started using Dragon 10 for medical transcription after a less than satisfactory experience 5 years ago or so. So far, pretty impressive.

I have no beef with medical transcriptionists who point out that human transcription is more accurate, and potentially context-senstitive, if the transcripitionist is medically trained. However, in the competitive world we inhabit, transcriptionists are pressed harder and harder, and quality is uneven, and it doesn't appear to me that all medical transcriptionists really have any medical training.

A large institution that can afford in-house transcription staff and which really cares about the quality will almost certainly beat Dragon. For me in a small office, dependent on sending transcription out, then dealing with the text files that come back, Dragon works better-- the editing chore is about the same, and the time factor is quite a lot better.

For actual writing tasks, I find it better to use a pencil and paper, then a keyboard. For getting medical notes into a chart, dictation beats writing.

Saying that speech recognition is dead because its accuracy falls far short of HAL-like levels of comprehension is like saying that aeronautical engineering is dead because commercial airplanes cannot go faster than 1,000 miles per hour, and by the way … they cannot get people to the moon.

See the rest of my response to this post at

Hi, Roberto:

Thank you for reading and your impassioned comment.

I read your blog and you write "If you think that speech recognition technology, after 50 years of so of research, would bring us HAL 9000, you are right to think it is dead."

That's what I think!

You go on to say "that type of speech recognition was never alive, except in the dreams of science-fiction writers." I agree that SF writers were big purveyors of that dream, but I think a lot of other people believed in it too, maybe most people--and that's why the death of that dream has gone unrecognized. Nobody wants to talk about it. It's pretty shocking.

What do you mean computers aren't automatically (i.e. with a lot of work by smart people like you) going to progress to understanding language?

Hard to believe.

Hi Robert ... thanks for the response to my response to your blog ... I started working in speech recognition research in 1981 ... Since then I built speech recognizers, spoken language understanding systems, and finally those dialog system on the phone that some people hate and techies call IVRs.. (now I don't build anything anymore because I am a manager :) ) ... but during all this time I never believed I would see a HAL-like computer in my life time. And I am sure the thousands of serious colleagues and researchers in speech technology around the world never believed that either. At the end we are engineers who build machines. And as we get to realize the inscrutable complexity and sophistication of human intelligence (and speech is one of the most evident manifestations of that), and the principles on which we base our machines, we soon understand that building something even remotely comparable to a human speaking to another human is beyond the realm of today's technology, and probably beyond the realm of the technology of the next few decades (but of course you never know ... we could not predict the Web 20 years ago...could we?).

Speech recognition is a mechanical thing ... you get a digitized signal from a microphone, chop it in small pieces, compare the pieces to the models of speech sounds you previously stored in a computer's memory, and give each piece a "likelihood" to be part of that sound. Pieces of sounds make sounds, sounds make words, words make sentences, and you keep scoring all the hypotheses in an orderly fashion based on statistical models of larger and larger entities (sounds, words, sentences), such as models of the probability a sound following other sounds in a word, a word following other words in a sentence, and so on. At the end you come up with an hypothesis of what was said. And using the mathematical recipes prescribed by the engineers who worked that out, you get a correct hypothesis most of the times... "most of the times" ... not always. If you do the things right, that "most of the times" can become large ... but never 100%. There is never 100% in anything humans, or nature, make...but sometimes you can get pretty damn close to it..and that's what we strive for as engineers.

So, there is no human-like intelligence (God forbid HAL-like evil intelligence) in speech recognition. No intelligence in the traditional human-like sense ... (but ...what's intelligence anyway?). There is no knowledge of the world, there is not perception of the world, and having experienced and thought about the world for every minute of our conscious and unconscious life. Speech recognition is a machine which compares pieces of signal with models of them ... period. And doing that with the "statistical" way works orders of magnitude much better than doing it in a more "knowledge-based" inferential, reasoning way...I mean doing it in an AI-sh manner... We tried that--the AI-sh knowledge-based approach--very hard in the 1970s and 1980s but it always failed, until the "statistical" brute force approach started to prevail and gain popularity in the early 1980s. AI failed because the assumption on which it was based presumed you can put all the knowledge into a computer by creating rational models that explain the world...and letting the computer reason about it. At the end it is the eternal struggle between rationalism and empiricism .. .elegant rationalism (AI) lost the battle (someone think the battle .... not the war) because stupid brute-force pragmatic empiricism (statistics) was cheaper and more effective ...

So, if you accept that ...i.e. if you accept that speech recognition is a mechanical thing with no pretense of HAL-like "Can't do that Dave" conversations, you start believing that even that dummy mechanical thing can be useful. For instance, instead of asking people to push buttons on a 12 key telephone keypad, you can ask them to say things. Instead of pushing the first three letters of the movie you wanna see, you can ask them to "say the name of the movie you wanna see" (do you remember the hilarious Seinfeld episode were Kramer pretended he was an IVR system? ... and why not? if you are driving your car, you can probably use that mechanical thing to enter the new destination on your navigation system without fidgeting with its touch screen. And maybe, you may be able to do the same with your iPhone or Android phone. At the basis there is a belief that saying things is more natural and effective that pushing button on a keypad, at least in certain situations). And one thing leads to builds on technology...creating more and more complex things that hopefully work better and better. These are the dreams of us engineers ... not the dream of HAL (although I have to say that probably that dream unconsciously attracted us to this field). Why that disconnect between engineer's dreams and laypeople dreams? Who knows? But, as I said, bad scientific press, bad media, movies, and bad marketing probably contributed to that, besides the collective unconscious of our species, that of building a machine that resembles us in all our manifestations (Pygmalion?).

I am not sure about your last questions. What I meant is that computers *are* automatically going to progress in language understanding. But they are doing that by following "learning recipes" prescribed by the smart people out there and digesting oodles of data (which is more and more available, and computers are good at that). The learning recipes we figured out until now brought us so far. If we don't give up in teaching and fostering speech recognition and machine learning research, one day some smart kid from some famous or less famous university somewhere in the world will figure out a smarter "recipe"... and maybe we will have a HAL-like speech recognizer .. or something closer to it...

Let's hear more from Guido Gallopyn, VP of Dragon.

Where is Dragon Naturally speaking going? He said, "The worst is that Its conclusions are blatantly incorrect." Example?

I think Dragon is pretty good -- but what does the company see in the future? What would make it a better product? Do they aspire to HAL over at Dragon?

I don't know about speech recognition, but syntactic parsing
(which has almost nothing to do with speech recognition)
is improving each year, so you might want to read something
newer than your 1986 or 1996 quote. I'd say by now it's around
92% (according to the measure you use).
it's easier to be ignorant and smug than do a little research I suppose.
Dear Smug & Ignorant:

The most recent figure I cite of 80% parsing accuracy comes from an email to me from Ted Briscoe, dated March 10, 2010. Briscoe is Professor of Computational Linguistics at Cambridge.

Briscoe's system is around 80%, but that does not mean
the best parsing system are at 80%. with all due respect,
maybe you should ask him again.

Collins' or Charniak's parsers are above 90%.

and I'm sure they've been improved by now.

I am not able to find evidence of >90% parsing accuracy on unrestricted text or year-on-year improvement.

I checked with one of the researchers you mention. In part, the email reply reads:

"On WSJ [Wall Street Journal] we have many systems which perform at >90% accuracy on various evaluation measures.

"Please don't quote this in the article though: I would prefer if you referenced papers giving specific numbers, the full evaluation set-up etc."

I replied back: "Is it the case that accuracy in the >90% range is reached only for restricted (if quite large) domains like WSJ?"

I have received no reply.

three systems above 90%. will that be enough ?


Bikel, D. (2004). On The Parameter Space of Generative Lexicalized Statistical Parsing Models. PhD Thesis, Computer and Information Science, University of Pennsylvania.

Collins, M. (1999). Head-driven Statistical Models for Natural Language Parsing. PhD Thesis, Computer and Information Science, University of Pennsylvania.

Charniak, E. and Johnson, M. (2005). Coarse-to-fine n-best parsing and MaxEnt discriminative reranking. Proceedings of the 43rd Annual Meeting of the ACL, pages 173–180, Ann Arbor, June 2005.

McClosky, D., Charniak, E., and Johnson, M. (2006) Effective Self-Training for Parsing. Proceedings of HLT/NAACL 2006, pages 152-159, New York City, USA, June 2006.

Petrov, S., Barrett, L., Thibaux, R., and Klein, D. (2006). Learning accurate, compact, and interpretable tree annotation. Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the ACL, pages 433–440, Sydney.

Petrov, S., and Klein, D. (2007). Improved inference for unlexicalized parsing. Proceedings of NAACL 2007, pages 404-411.

Sleator, Daniel & Davy Temperly (1993) "Parsing English with a Link Grammar", Third International Workshop on Parsing Technologies.

The three parsers you refer to achieved >90% scores on the Wall Street Journal corpus, section 23, evidently.

The Wall Street Journal is not unrestricted text.

define unrestricted then.
if you mean oral english, then
1- it's a somewhat different problem
2- humans can't always do it (so the target is lower anyway)

if you mean written text, just add whatever you have in mind to
the training corpus. or read papers written after 96.
but the way this is going, I'm expecting more bad faith
to fit your line of argumentation ("nlp does not work, and will
never work; and nothing has changed since the 60s, you lazy
academic bums !").
it's probably a better read than a qualified analysis,
I'll give you that. too bad it's only loosely related to reality.
blog away then.

It is shocking to read the claim that accuracy has not improved since 2001. It has improved tremendously! Nuance's DNS is much more accurate today than it was just a few years ago. Mr. Robert Fortner's essay seems incredibly misconceived.
Here is the bottom line -- if you want to believe that accuracy has not improved over nine years, go ahead and believe it. Meanwhile, those of us that know that it has improved markedly will use this fantastic program, Dragon NaturallySpeaking, which I'm using to type this very message by voice without touching the keyboard once.

I have a friend whose 80-year-old father is writing his memoirs. He writes his book on yellow legal pads, and tapes into an old-fashioned dictating machine which uses a cassette. He then sends the tape to a service that transcribes the tapes for him by typing the words into a computer.

When my friend told me about this, I almost fell over laughing. in addition to saving a ton of time and hassle, he could cut out the yellow pads (by the way, he claims to be an environmentalist), cassettes and paying the transcription service. I tried to tell them many times about the wonders of voice dictation, but he just won't listen.

But then again, this self-professed environmentalist also refuses to use a digital camera as well, so he still takes pictures with a film camera and has the pictures developed.


Voice to text conversion could benefit from some serious competition. Dragon Naturally Speaking is really very good. But it's proprietary and nearly a monopoly. All monopolies deny they stifle innovation (at least since James Watt opposed the development of high pressure steam engines in the early nineteenth century), but almost by definition their upgrades cannibalise their own user base (as has happened with Microsoft products). Open source development provides one solution by encouraging innovation without destroying profitability and by making new releases part of the business model. There are now few proprietary applications that don't have a workable open source alternative. Unfortunately speech recognition seems to be one of them. Does anyone have an opinion on the prospects for open source voice to text conversion? Perhaps the development of Android makes this more likely.
One of my only complaints regarding DNS (apart from the monopoly) is that it requires Windows. There seem to be no plans to port Naturally Speaking to Linux, but in the mean-time it works using Wine [].
fourcultures: There may be no Open Source speech recognition out there, but if you're just looking for high-speed text input, speech recognition might not be what you need anyway. Stenography can offer transcription at up to 300 words per minute, without any of the frustrating black box/fuzzy logic/non sequitur problems that even the best speech recognition engines are rife with. Plover, the Open Source Steno Program ( is in constant development, and is currently compatible with a $45 qwerty keyboard that can be reversibly refitted to work as a steno machine. I'm also in the process of writing a free Steno 101 series for people who want to teach themselves high-speed text input on their own. For more information, read the "What Is Steno Good For?" series on stenographic applications for mobile/wearable computing, accessible technology, and much more:
I don't think of Dragon NaturallySpeaking as a monopoly. To me, a monopoly is a company that forces you to buy their operating system with every computer that you buy. Nobody's forcing anybody to buy speech dictation software. So many people on this thread are complaining about how awful and unworkable speech dictation is (and they are dead wrong) - well, nobody is putting a gun to your head to buy it.

It seems strange to me to want to go open source to save money and then pay $45 for a QWERTY keyboard and teach yourself stenography for faster input of text, when voice dictation is clearly the solution, and much, much faster than stenography. To me that's like trying to find a better horse and buggy while cars are blowing by you on the freeway (but to make the analogy complete, it's like people driving horses and buggies and not really believing cars exist, since clearly most people don't believe that voice dictation works). Right now I'm typing this using version 10 and not even having to touch the keyboard. Why is that so difficult for people to believe?

And Dragon NaturallySpeaking is available for $36 here:

and the version for teens is only $19 on Amazon.

And if you think that Dragon will stifle innovation, think again -- read this excellent review of version 11 by David Pogue in the New York Times. (By the way, I don't know what he's talking about when he refers to Skype -- Dragon works perfectly for me on Skype chat).

To be quite honest with you, I don't care what company purchased which company, or who litigated against whom, what company changed their name, or any of that. All I care about is that this voice dictation program works, and it works extremely well. I just wish people would try it before vilifying it.

voice dictation is clearly the solution, and much, much faster than stenography.

Where on earth did you get that idea? Voice dictation isn't even a little bit faster than stenography. For most users, voice recognition tops out in the 160-180 WPM range. Some very, very skilled users who've spent thousands of hours training their software and their own voices might get up to the 220-240 range. But stenographers frequently reach speeds of 260 to 280 words per minute, and the Guinness World Record for text input was 360 words per minute at 97.22% accuracy, set by Mark Kislingbury in 2004.

Someone please explain "stenography" to me. I think of it as what court reporters do -- madly trying to take down everything everyone says, so speed is important. I don't think of it as something an author might do -- getting his own thoughts into type. The latter doesn't really require extreme speed, but it has to be something that is natural, like speaking, unless one can afford to hire a stenographer.

After 3 months of off and on work with a version of DNS, I can now get my medical records completed faster, and much more legibly, than I used to do by writing, and even having to go back and correct a few errors, it is faster than I can do by keyboard -- even though I am fairly proficient at typing.

Seems like this argument, if you can call it that, amounts to several people giving monologues--and talking past each other. In reality, there are several tools for getting speech into printed words, and different ones work in different environments.

I can't afford to hire a stenographer -- even regular transcription is beyond me to afford at medicare and welfare reimbursement rates -- and I don't have any need of the speed of stenography, nor any interest in learning it myself.

On the other hand, no one is arguing that DNS can take dictation from multiple people, especially at the same time, or that it works in noisy environments, and I have no doubt that stenography is faster than voice recognition and subsequent correction.

Tomastoria: I was only addressing Dan Nainan's claim that voice recognition was faster than stenography. You're right that speed is not always essential for all tasks (though I, personally, much prefer writing silently to dictating, and I've found that steno helps me compose fiction and other prose with astonishing fluency -- and that dictation, especially medical dictation, which contains a lot of hard to spell words, is one of the best reasons to use voice recognition software like DNS. It's got many other uses, including helping people who can speak but can't type use their computers, and in situations like navigation where hands-free dictation is important. My argument is that, one, DNS's current ability to accurately take slow, measured dictation from a trained voice does not mean that eventually speech recognition software will accurately transcribe multiple voices in natural language settings with imperfect audio, including punctuation and error correction. I'm a stenographer for the Deaf and hard of hearing, and those are the conditions that I work in. I just get a little sick of people saying that my job is going to quickly become obsolete, despite decades of evidence to the contrary. (I speak about that in more detail here: Dictation is one thing. Speaker-independent natural language processing in real-world conditions is a very different thing, and people who argue that software will inevitably make the leap from the first circumstance to the next are not understanding all the variables involved.

My other argument, though, is that for people who are currently typing using qwerty and who don't necessarily want to use their voice to do their job -- whether they have an accent or prefer thinking with their fingers or are in situations where they don't feel comfortable speaking aloud all the time -- steno is an excellent alternative, and there's now an open source option for them to use.

Here's a video demonstrating the difference between steno and qwerty:

If you like, you can also turn on the auto-captions and see what YouTube's speech processing made of the dictation, but that's pretty much beside the point. I just think that if more people used steno instead of qwerty, they'd be able to get stuff done more easily and more efficiently, whether at 100 or 300 WPM. But for someone like you, tomastoria, it seems like DNS is a great solution to your needs, and I'm not criticizing it at all. There's enough room in this world for multiple input systems, and when voice recognition works well, it's a great good thing.

Thanks, Mirabai.
I agree with everything you say.

I too, prefer to write, but my records now have to be in typescript of some sort -- more and more people have access to the medical records (in this era of "privacy", no less!) and they are angry about receiving copies of what to them are illegible notes.

The other option for medical records is much worse -- point and click mouse selection of stock phrases, producing a sort of boiler-plate record that is meaningless, but seems to satisfy the government and insurance company bureaucrats. There are a lot of records like that out there, and they are beyond awful.

QWERTY is too slow for me, despite my being fairly proficient, and my hands and fingers get tired.

The YouTube captions you posted were ludicrous -- but I think whatever speech recognition program they were using is quite inept. DNS produces some howlers, but that was ridiculous. DNS in a controlled environment (my home or office) with one speaker, speaking a fairly consistent vocabulary and a lot of repetitive phrases at a relatively slow rate works well enough. No doubt steno would be an improvement -- especially in accuracy, since I don't need more speed than I get with the current setup.

But for multiple speakers in a classroom or courtroom or lecture hall, there can be no substitute for a human intelligence behind a steno machine if the idea is to get words into type.

I always wondered how they did it in the old days -- it couldn't have been a verbatim transcript, but trained stenographers must have existed long before steno machines.

I don't believe stenography is going away any time soon as a profession. I do hold that computer voice recognition works well enough in some situations to solve the problems that the existence of computers caused.

My apologies for misstating the speed of stenography versus voice dictation; I stand corrected. However, I simply cannot understand the perception that Dragon takes inordinate time for training. I've trained many machines, and it takes perhaps six or seven minutes, then you're off and running. The new version 11, which I can't wait for, apparently can do away with training altogether. So this is a step in the right direction towards recognizing multiple voices - nothing like that is going to happen overnight, but then again, 30 years ago, nobody could have anticipated the level of speech recognition we have now. I wouldn't bet against technology.

Stenography certainly has its place, but I think the learning curve is too difficult, and besides, what am I going to do, carry around a steno unit with me everywhere with my laptop? And everyone seems to be moaning and groaning about the time did Dragon takes to train (again, six or seven minutes), but how long would it take to get proficient on stenography? A year? The irony of all this is that people already know how to speak, whereas they have to learn how to use a steno machine.

Many times I hear the objection that speaking to the computer is not natural, and typing is, and that voice dictation would be too hard to get used to. I strongly beg to differ. The hands are a superfluous middleman between the brain and the computer...typing just gets in the way. In the year 2010, everyone is still using a suboptimal keyboard layout that was originally designed to *slow down* typists.

Dragon does NOT require slow, measured dictation with repetitive phrases. I'm speaking to it right now, using normal conversational speed, and it's typing everything I'm saying accurately. Certainly I don't claim that it's 100% perfect, but it's certainly 100 times better than having to type everything.

Here's another example of the misperception about voice dictation. In an otherwise fantastic article about how Star Trek correctly predicted a lot of technology that we have today, one of the people interviewed said the following:

Michael also noted that voice input is generally inefficient. "Imagine I'm looking at some photos, and I want to say, 'Up, up, left, down one, photo number 3362, no, the one on the left.'—that's much slower than just clicking or tapping," he said. "Natural language is, I think, going to have some significant limitations."

And yet, Dragon has a fantastic solution -- it numbers all the pictures on the page, so all you have to do is say the number -- which is faster than using the mouse!

Here's the link to the article:

Suppose I wrote a post that stating that digital photo retouching hasn't made any progress and that everything still needs to be airbrushed by hand. You would call me a nut case, and you would point to Adobe Photoshop as proof that digital photo editing works, and has worked for years. Then, suppose I responded with a bunch of untruths about how the software doesn't work correctly, then on top of it I went on to say how evil Photoshop is because it's a monopoly, then detailed how Adobe bought it from John Knoll, and how the name was changed from Display to Image Pro to Photoshop...I see no difference between that argument and this debate here.

I have lost all faith in speech recognition software. Not in the phone and definitely not in my computer. It has started out fine... I had high hopes but sheer fact that voice recognition is not entirely accurate AND it has a very difficult time with accents. Now being ethnically russian I do have an accent but its not as heavy as what you see in the movies. I have a mix of russian, british, new jersean and floridian accents and along and behold my cell phone voice dialing cant understand simple words like "Mama" or "Mom" or "Alex" or "Josh" or "Work" or "One" or "No"... no matter what I do or how I say it. My laptop with vista and that voice engine is even worse having accuracy rate of maybe 30% at best. I type really fast so for me it is easy to just type it out myself instead of waiting for computer to figure out what it thinks I am trying to say. Now you might say that it may be me... and it probably is in part.. but then how do you explain the fact that when I use "speak commands" in video games.. game recognizes me with no problems?! So like I said.. I have lost all faith in speech recognition
Finding 1: Speech recognition works very well for some users when performing some tasks.
Finding 2: Other people find it not to be of much use at all, for any purpose.
Conclusions: The given data contains one or more contradictions. As a result, nothing can be deduced c „ .

Meta Conclusions: Humans are not intelligent, and probably never will be.
Their major development has reached a plateau.

Meta meta conclusion: It is not yeans to purchase a human now.
Suggestion: Wait for the next version seems pretty accurate on their demo
Pfffft. When Nuance brought DD out on Mac I became a user for the first time, having avoided it previously because of view similar to yours. Now, just a few months later, I'm using it for all my computing tasks except for writing code (I'm a software developer) and getting 99%+ accuracy. Good results depend on having a good microphone and good dictation. OK, so I'm learning to speak more clearly - having to adapt to my computer to get it to do what I want. Well the keyboard and mouse also force me to adapt to my technology. With speech, its the computer coming to me rather than the other way around.

What a waste ... let's go back to caves and clubs, creating new tech is just twooooo hawrd for us.

Having read the article and some of the detailed rebuttals in the comments I'm not convinced. That said I haven't used speech recognition in years.
If there's anybody out there who still living in the dark ages who doesn't believe voice dictation works, watch this – in the first video, I'm dictating on an airplane (the four errors are because I'm talking softly so as not to disturb my fellow passengers).

In the second video, I dictate into an old Treo, remove the SD card and have Dragon transcribe what I said.

Anybody still believe voice dictation doesn't work?

Such a nice post. Keep it up.

Generic Viagra

I would love to see the results of a test like the following: pick a random person on a street of some random town in the US. You will equip that person with the best listening equipment available, including custom-fitted and carefully adjusted headphones, user-adjustable sound filtering, or anything along those lines in existing technology. You may even hire an audiologist to compensate electronically for any weaknesses in the subject's hearing.

Now feed into the subject's audio system some of the real-world instances of speech that people would like a computer to understand, with exactly as much or as little context as the computer gets. The selected human is supposed to type the words on a keyboard.

I would bet that a typical test subject, selected and equipped in that way, would score about 50% accuracy at the beginning, and then slowly improve over the next several weeks and months, assuming the person continued doing that job for several hours per day.

A significant fraction of people living in the US can neither speak, read nor write their own native language with a high degree of accuracy or comprehension. The average level is probably similar in most countries of the world, and maybe a bit better in some European countries or maybe in Japan.

After the test, you will be free to choose between hiring that human subject to do the necessary work, or buying a computer VR system. I think I know what the outcome will be.

We tend to suppose that because a computer can do jobs like arithmetic faster and better than an average person, computer speech recognition will turn out to be either easy or effectively impossible.

That is not the case. Computers and software have improved to the point that they can often do an acceptable job at the extraordinarily difficult task of understanding what humans are trying to say.

The quality of speech recognition in commercial products 15 years ago was indeed "horseshit", if judged by my memory of commercial products at the time. However the quality is very much better now. Products out of the box give excellent results. As proof of this, I can use audio to dictate a letter on my computer, although it has never been trained on my voice. This article is rubbish - (I accept that it may have been accurate in 2002 or even a few years later) and should be revised or withdrawn. Or is it being presented as disinformation, and if so, for what reason?
I just posted a comment disagreeing with the author. However I do agree that ASR and TTS may have a very long way to go before they can compete with the language capabilities of a human two or three year old. Humans do not require to be supplied with "phonetic models" and "language models", or such barbarisms as "manually labelled speech", in order to work effectively.
11 visitors upvoted this post.