How Do Computers Understand Speech?

More and more, we can get computers to do things for us by talking to them. A computer can call your mother when you tell it to, find you a pizza place when you ask for one, or write out an email that you dictate. Sometimes the computer gets it wrong, but a lot of the time it gets it right, which is amazing when you think about what a computer has to do to turn human speech into written words: turn tiny changes in air pressure into language. Computer speech recognition is very complicated and has a long history of development, but here, condensed for you, are the 7 basic things a computer has to do to understand speech.

1. Turn the movement of air molecules into numbers.

Wikimedia Commons

Sound comes into your ear or a microphone as changes in air pressure, a continuous sound wave. The computer records a measurement of that wave at one point in time, stores it, and then measures it again. If it waits too long between measurements, it will miss important changes in the wave. To get a good approximation of a speech wave, it has to take a measurement at least 8000 times a second, but it works better if it takes one 44,100 times a second. This process is otherwise known as digitization at 8kHz or 44.1kHz.

2. Figure out which parts of the sound wave are speech.

When the computer takes measurements of air pressure changes, it doesn't know which ones are caused by speech, and which are caused by passing cars, rustling fabric, or the hum of hard drives. A variety of mathematical operations are performed on the digitized sound wave to filter out the stuff that doesn't look like what we expect from speech. We kind of know what to expect from speech, but not enough to make separating the noise out an easy task.

3. Pick out the parts of the sound wave that help tell speech sounds apart.

Wikimedia Commons

A sound wave from speech is actually a very complex mix of multiple waves coming at different frequencies. The particular frequencies—how they change, and how strongly those frequencies are coming through—matter a lot in telling the difference between, say, an "ah" sound and an "ee" sound. More mathematical operations transform the complex wave into a numerical representation of the important features.

4. Look at small chunks of the digitized sound one after the other and guess what speech sound each chunk shows.

There are about 40 speech sounds, or phonemes, in English. The computer has a general idea of what each of them should look like because it has been trained on a bunch of examples. But not only do the characteristics of these phonemes vary with different speaker accents, they change depending on the phonemes next to them—the 't' in "star" looks different than the 't' in "city." The computer must have a model of each phoneme in a bunch of different contexts for it to make a good guess.

5. Guess possible words that could be made up of those phonemes.

The computer has a big list of words that includes the different ways they can be pronounced. It makes guesses about what words are being spoken by splitting up the string of phonemes into strings of permissible words. If it sees the sequence "hang ten," it shouldn't split it into "hey, ngten!" because "ngten" won't find a good match in the dictionary.

6. Determine the most likely sequence of words based on how people actually talk.

There are no word breaks in the speech stream. The computer has to figure out where to put them by finding strings of phonemes that match valid words. There can be multiple guesses about what English words make up the speech stream, but not all of them will make good sequences of words. "What do cats like for breakfast?" could be just as good a guess as "water gaslight four brick vast?" if words are the only consideration. The computer applies models of how likely one word is to follow the next in order to determine which word string is the best guess. Some systems also take into account other information, like dependencies between words that are not next to each other. But the more information you want to use, the more processing power you need.

7. Take action

Once the computer has decided which guesses to go with, it can take action. In the case of dictation software, it will print the guess to the screen. In the case of a customer service phone line, it will try to match the guess to one of its pre-set menu items. In the case of Siri, it will make a call, look up something on the Internet, or try to come up with an answer to match the guess. As anyone who has used speech recognition software knows, mistakes happen. All the complicated statistics and mathematical transformations might not prevent "recognize speech" from coming out as "wreck a nice beach," but for a computer to pluck either one of those phrases out of the air is still pretty incredible.