When there aren’t blenders blending, dogs barking, or TVs blaring in the background, automated speech recognition (ASR) technologies like Apple’s Siri and Amazon’s Alexa are amazing. But consumers and companies are pushing use cases into increasingly challenging situations with more background noise, which leads to less successful outcomes. With so many companies increasingly turning to voice-enabled products—personal assistants, home hubs, smart TVs, hands-free automotive—the industry needs solutions to improve ASR accuracy.
I talked with John Walker, CEO of Cypher Corp., about the increased need for managing background noise and improving advanced speech-recognition systems.
Wong: What are the challenges associated with eliminating background noise?
Walker: In the case of noise control, much of today’s technology traces back to patents that were filed as early as the 1930s. Historically, devices have required custom hardware to mask the noise. This means they need early-stage integration into product design, delaying schedules, limiting flexibility, and impacting costs. Making things worse, these methods are typically only effective on constant noise sources from fixed locations.
When most noise-cancellation technology was invented, devices weren’t nearly as portable, ubiquitous, or capable as they are today. Bluetooth only recently reached a level where it was considered good enough to replace wires for carrying quality audio. Processors only recently became fast enough and affordable enough to appear in virtually any consumer electronics product. Finally, wireless networking speed and coverage only recently reached a point where cloud services can accommodate an ASR vocabulary library of 200 trillion+ possible word combinations and allow for real-time learning with contextual clues.
Wong: So what’s the solution?
Walker: As we continued to push voice as a control path into devices in uncontrolled noise environments, it became clear that we had reached the limits of a hardware-based, noise-centric approach. Building chips that tried to identify noise types and filter them out wasn’t working. In our case, we learned early on that we wouldn’t be able to isolate and block every random sound. That led us to the discovery and development of a machine-learning approach that enables us to isolate all of the sounds of the human voice via a deep neural network and only allow that through. By doing so, we block virtually all background noise.
Wong: What is Cypher doing in the space?
Walker: Cypher was founded on the idea that while we increasingly use voice-centric devices in more places and rely on them to do more things, they haven’t undergone a significant improvement in noise-immunity quality. Our team set out to solve this problem in a novel way; treating it as a math problem rather than an audio problem. Three years later, we have a machine-learning-based software solution that provides over three times the noise suppression of the leading smartphone and improves speech-recognition performance in noisy conditions by over 120%.
Cypher’s technology is software-only. It is capable of running on the types of processors commonly found in smartphones and consumer electronics devices, and can identify the physical attributes of the human voice, regardless of age, gender, or language.
Our approach does not rely on the acoustical design of hardware or on network bandwidth upgrades. Instead, Cypher’s software approach can fit onto the existing processors of digital signal processors (DSPs) found in even simple consumer electronics devices. We use advanced math, neural networks, and pattern matching to detect the unique qualities of the human voice and ignore everything else, making a user’s voice crystal clear to the speaker on the other end of the line—even in the noisiest environments. This is ideal for use in mobile phones, but has applications for any product that relies on voice.
Our process also improves ASR accuracy. We just completed a test of the latest version of our ASR filtering on Amazon Echo’s Alexa, seeking to measure not only its ability to detect a single word, but its ability to accurately detect the entire command to optimize a correct response—using over 1,000 different open-ended queries. Cypher improved Alexa’s ASR accuracy by as much as 121% in a noisy environment.
Wong: Don’t neural networks require substantial amounts of space and processing power within the device itself?
Walker: Many developers think you need expensive hardware or a lot of memory within the device to use a neural network to properly tackle background noise cancellation. These are myths. One of the biggest misconceptions about neural networks is that they require a lot of space and processing power.
In Cypher’s case, we take a very large database of human speech and train our deep-learning neural network on servers in advance. Once trained—a process that can take hours or even days—the resulting Cypher algorithm is capable of identifying the qualities of human voice versus other noise types. Because all of this training takes place offline, by the time a developer adds our software to a device, it’s pre-trained to start working immediately. It has a small-enough footprint to provide real-time speech enhancement on virtually any type of product, such as phones, fitness trackers, Bluetooth headsets, etc. There is no need for a persistent network connection when the software lives on the DSP itself.
Wong: What is actionable advice to product manufacturers, and does it differ if you’re adding to an existing product versus building something from ground up?
Walker: If you want to improve the background noise cancellation and/or speech-recognition accuracy in an existing product, it’s best to talk with your DSP chip manufacturer to see what options they have available. For instance, we are working with CEVA TeakLite and Qualcomm Snapdragon. If you are developing a product from the ground up, with some planning you can optimize everything from the number and placement of microphones to processing and memory requirements.
Cypher software can be deployed in three different ways: Embedded in existing hardware, integrated into the OS, or as an application running on top of the OS. Cypher only needs input from two microphones in all implementations.