Microsoft VocalZoom. Siri, Cortana, Alexa - we're all more and more reliant on voice recognition for every day interaction with our devices. The intent is great. The execution? Not so much.
There are two big problems that voice recognition software needs to overcome - recognising words with a high degree of accuracy to make search functions effective, and recognising an individual voice where there is a degree of ambient noise. Enter VocalZoom.
Microsoft set a team at their Artificial Intelligence and Research group the challenge of reproducing the same low error rate as that of a human transcriber. They used a Long Short-Term Memory (LSTM) model, trained for 2,000 hours using a 30,000 word vocabulary, then tested it against human transcribers using National Institute of Standards and Technology (NIST) tests involving telephone conversations.
The results were outstanding. On the first test, replicating strangers discussing a pre-arranged topic, human transcribers returned 5.9% errors - as did the voice recognition system. On the second test, where family members have open ended conversations, the Microsoft model did even better, scoring an 11.1% error rate as opposed to the human transcribers' 11.3%. For the first time, voice recognition software achieved parity with humans.
But what about ambient sound? VocalZoom uses information it gathers via an optical laser to analyse movements of the mouth and face - just as humans do when they're struggling to hear in a noisy environment. Combining this information with audio of voice and ambient noise, interferometry techniques are used to filter out the ambient sound unrelated to speech.
The VocalZoom laser is designed not to hurt the eyes, and Microsoft are already working on a range of applications, including wearables, to allow the laser to focus at distances of 5cm to one metre.
So is this combination of laser signals, emitted at 1-2KHz that don't interfere with the auditory range of 4-6KHz, and the neural network-based voice recognition the long awaited solution for high quality voice activation? Very possibly. It overcomes the problems of error rates and ambient sounds with ease. The next step will be to extract actual meaning from what is said.