San Francisco — As handy as all our voice recognition friends are, conversing with them still feels you’re talking to a foreign relative.
Whether its Siri (Apple) or Alexa (Amazon) or Google Assistant or Cortana (Microsoft), each requires the human to speak in slow, articulated phrases to increase the odds of comprehension.
But researchers at Microsoft say they’ve reached a milestone that promises a future where machines can transcribe us as well as another person. In a paper published Monday called “Achieving Human Parity in Conversational Speech Recognition,” engineers with Microsoft Artificial Intelligence and Research announced they’d developed a speech recognition system that makes the same or fewer errors as professional transcriptionists.
The team hit a word error rate of 5.9%, down from the 6.3% WER the team reported just last month. That 5.9% rate is about equal to that of people who were asked to transcribe the same conversation, and according to Microsoft it is the lowest ever recorded.
“We’ve reached human parity,” Xuedong Huang, the company’s chief speech scientist, said in a statement. “This is an historic achievement.”
Improving voice recognition is bound to have an impact on both consumers and enterprises alike.
“This will make Cortana more powerful, making a truly intelligent assistant possible,” said Harry Shum, who heads Microsoft’s AI division.
Almost every major tech company is pouring resources into machine learning and artificial intelligence, including Apple, Google and Amazon, which has had an unexpected hit with its Alexa-powered Echo.
Last month, Amazon announced its new Alexa Prize, which is aimed at college students with an interest in developing AI that is able to converse at length with humans. The winning team will receive $500,000, but Amazon will award students an additional $1 million if they successfully get their AI to speak with humans engagingly for 20 minutes.
“A socialbot that can converse coherently for 20 minutes is unprecedented and at least five times more advanced than state-of-the-art conversational AI,” Rohit Prasad, vice president and head scientist of Amazon Alexa, said in a statement when the prize was announced.
Samsung recently bought itself AI expertise with the purchase of Viv, an AI-powered voice assistant developed by Dag Kittlaus, one of the founders of Siri.
Geoffrey Zweig, who heads Microsoft’s speech and dialog group, said the next goal of the team is to ensure that voice recognition works in a broad array of real-life settings, whether at a party or with road noise. The team also will try and tackle teaching a machine to distinguish between multiple voices speaking with different accents.
But ultimately, the goal indeed is not just correctly hearing what a human has said, but truly understanding the meaning, which in turn could lead to action.
“The next frontier is to move from recognition to understanding,” Zweig said.