Decision time? Check out our latest product comparisons

New approach promises more accurate speech recognition software

By

August 24, 2012

Research from NTNU is attempting to recognize human speech more accurately by detecting ho...

Research from NTNU is attempting to recognize human speech more accurately by detecting how we use our vocal tract to produce sound (Image: Shutterstock)

Researchers at the Norwegian University of Science and Technology (NTNU) are combining two of the best-known approaches to automatic speech recognition to build a better and language-independent speech-to-text algorithm that can recognize the language being spoken in under a minute, transcribe languages on the brink of extinction, and make the dream of ever present voice-controlled electronics just a little bit closer.

The exponential yearly improvements in processing power we are seeing give hope that we are quickly moving toward superbly accurate and responsive speech recognition – and yet, things aren't quite that simple. Even though this technology is slowly making its way into our phones, tablets and personal computers, it'll still be some time before keyboards disappear from our digital lives altogether.

Achieving accurate, real-time speech recognition is no easy feat. Even assuming that the sound acquired by a device can be completely stripped of background noise (which isn't always the case), there is hardly a one-to-one correspondence between the waveform detected by a microphone and the phoneme being spoken. Different people speak the same language with different nuances – accents, lisps and other articulation defects. Other factors such as age, gender, health and education also play a big role in altering the sound that reaches the microphone.

In other words, faster processors alone are useless, because we also need a robust plan of action to use all that number-crunching power the right way – with efficient, reliable computer algorithms that can figure out how to see through the incredible variety of sounds that can come out of our mouths and accurately transcribe what we are saying.

The NTNU researchers are now pioneering an approach that, if it can be fully exploited, may lead to a big leap in the performance of speech-to-text applications. They demonstrated that the mechanics of human speech are fundamentally the same across all people and across all languages, and they are now training a computer to analyze the pressure of sound waves captured by the microphone to determine which parts of the speech organs were used to produce a phoneme.

Many of the most successful speech recognition software available today asks users to provide personal information about themselves, including age group and accent, before they even attempt to transcribe human speech for the first time. When creating a new profile, users are also often asked to read some text to first calibrate the software parameters.

This is because speech recognition software often uses data fed by users to continuously improve its accuracy. It often uses probabilistic tools – namely, Bayesian inference – to estimate the probability of a certain sound being spoken given the user's speech patterns that it has learned over time. This means the quality of the transcripts can sensibly improve after the program has collected a critical amount of data on the user. On the flip-side, speech recognition may not be too accurate right after a new user profile has been created.

An alternative to the statistical approach described above is to have humans study sounds, words and sentence structure for a given language and deduce rules which are then implemented into the software. For instance, different phonemes show different resonant frequencies, and the typical ranges for these frequencies can be programmed into the software to help it detect the sound more accurately.

The system developed at NTNU is a blend of the two approaches: it collects data to learn about the user's speech nuances and improve accuracy over time but, crucially, it also incorporates a rule-based approach that is based on phonetics – the study of the sounds of human speech.

Detecting the pressure of sound waves on the microphone could mean achieving higher accuracy than was previously possible. As an example, sounds can be classified as voiced (in which vocal cords vibrate) and voiceless (in which they do not). The analysis of the pressure of sound waves on the microphone can detect the vibration of the vocal cords directly rather than deducing it from the peak frequencies captured by the microphone.

Because the anatomy of speech is the same across all humans, one of the strengths of the system is that it is completely language-independent. Therefore, unlike previous approaches, it can be easily adapted to a new language without much work at all, opening the door to idioms spoken by minor groups for which a commercial speech-to-text software isn't a viable solution.

The team is now looking to develop a language-independent module that they can use to design competitive speech recognition products. Such software could also do very well transcribing text in more than one language as, the researchers say, it only takes the system 30 to 60 seconds to identify a given spoken language.

Source: Research Council of Norway

About the Author
Dario Borghino Dario studied software engineering at the Polytechnic University of Turin. When he isn't writing for Gizmag he is usually traveling the world on a whim, working on an AI-guided automated trading system, or chasing his dream to become the next European thumbwrestling champion.   All articles by Dario Borghino
7 Comments

I don't believe this approach would lead to something great.

Actually, humans can recognize speech better because they UNDERSTAND the meaning of phrase.

So they they expect some words to appear and some not.

This helps to understand speech even in noisy conditions.

Pavel Chernov
26th August, 2012 @ 06:24 am PDT

The article uses the phrase "voice recognition" incorrectly. Voice recognition concerns WHO is speaking. "Speech recognition" is the term for identifying WHAT is being said.

- Thanks for pointing this out. The story has been corrected. - Ed.

piperTom
26th August, 2012 @ 07:00 am PDT

@Pavel valid point. Often times when I don't hear someone I find myself trying to match the things they may have said into the garbled sound segment I heard until I have something that matches the context. If it is within the context of the previous part of conversation it is usually easier to decipher than if the subject changed to something else. I suspect speech recognition could tap the conversation history of the application being used to do the same without a lot of difficulty.

Daishi
26th August, 2012 @ 08:26 pm PDT

"Detecting the pressure of sound waves on the microphone could mean achieving higher accuracy than was previously possible.

--that is the definition of a microphone, though..

" As an example, sounds can be classified as voiced (in which vocal cords vibrate) and voiceless (in which they do not). "

==ok but that is not any different from what is done now

"The analysis of the pressure of sound waves on the microphone can detect the vibration of the vocal cords directly rather than deducing it from the peak frequencies captured by the microphone."

==meaningless

wle
27th August, 2012 @ 10:06 am PDT

@diachi agreed.

some software on mobile devices (attempts to use) likely phrase matching to better guess the word you just mis-typed. Leading to less acceptance of 'monkey penguins' as likely words. although, individually the patterns may match up, as a phrase, it (almost?) never appears.

a method that analyses a phrase based on the best matched word in that phrase, followed by the second most likely word that has a high phrase-match value would not necessarily produce what you said (if you said "monkey penguins") but would have a higher probability of producing what you said if you, in fact, said, "emperor penguins" (as 'emperor penguins' appears more often as a phrase part than 'monkey penguins')..

gee, I hope that makes sense. :-)

MockingBird TheWizard
27th August, 2012 @ 11:35 am PDT

My Galaxy S3 is amazing at V.R. Anyone can speak to it (except young children) and it does a very good job of it.

As for background noise, it occurs to me that having 2 microphones could be used to determine the direction of the sound (as our ears do). If it's not on axis, ignore it.

People who are deaf in one ear have enormously greater difficulty understanding speech in a noisy environment than those with normal hearing.

warren52nz
27th August, 2012 @ 02:04 pm PDT

I TRY to use speech recognition software at work as to compensate for some physical challanges, (other than spelling), and it is a pain. When I had a good current user data base it was pretty good... But that has been lost when there has been an "equiptment refresh" @ work. I am hoping for a processor upgrade for the most current solution, but this looks promising?

Kevin E. James
27th August, 2012 @ 03:02 pm PDT
Post a Comment

Login with your gizmag account:

Or Login with Facebook:


Related Articles
Looking for something? Search our 29,159 articles