Google’s DeepMind artificial intelligence has produced what could be some of the most realistic-sounding machine speech yet. WaveNet, as the system is called, generates voices by sampling real human speech and directly modeling audio waveforms based on it, as well as its previously generated audio. In Google’s tests, both English and Mandarin Chinese listeners found WaveNet more realistic than other types of text-to-speech programs, although it was less convincing than actual human speech. If that weren’t enough, it can also play the piano rather well.
Text-to-speech programs are increasingly important for computing, as people begin to rely on bots and AI personal assistants like Apple’s Siri, Microsoft’s Cortana, Amazon’s Alexa, and the Google Assistant. If you ask Siri or Cortana a question, though, they’ll reply with actual recordings of a human voice, rearranged and combined in small pieces. This is called concatenative text to speech, and as one expert puts it, it’s a little like a ransom note. The results are often fairly realistic, but as Google writes, producing a new audio persona or tone of voice requires having an actor record every possible sound in a database. Here’s one phrase, created by Google.
The alternative is parametric text to speech — building a completely computer-generated voice, using coded rules based on grammar or mouth sounds. Parametric voices don’t need source material to produce words. But the results, at least in English, are often stilted and robotic. You can hear that here.
Google’s system is still based on real voice input. But instead of chopping up recordings, it learns from them, then independently creates its own sounds in a variety of voices. The results are something like this.
Actually, that’s not quite right. On its own, WaveNet only knows a language’s sounds, not its content. Fire it up, and it produces pleasing yet mysterious nonsense, complete with human-like pauses and breath sounds. For meaningful speech, Google shapes the results with linguistic rules and suggestions.
But the system itself isn’t specifically tied to speech. It can learn, for example, from piano music, too.
Granted, there’s already plenty of generative music, and it’s not nearly as complicated as making speech that humans will recognize as their own. On a scale from 1 (not realistic) to 5 (very realistic), listeners in around 500 blind tests rated WaveNet at 4.21 in English and 4.08 in Mandarin. While even human speech didn’t get a perfect 5, it was still higher, at 4.55 in English and 4.21 in Mandarin. On the other hand, WaveNet outperformed other methods by a wide margin.
WaveNet isn’t going to be showing up in something like Google Assistant right now — its step-by-step processing system requires a comparatively huge amount of computing power. But Google explains more about the system here, complete with more samples. For mathematical models and other details, there are also two papers posted online.