How HomePod uses machine learning to boost far-field Siri accuracy

In a new post published on Monday through its Machine Learning Journal blog, Apple goes on to detail how HomePod, its wireless smart speaker, uses machine learning to increase far-field accuracy, which helps Siri disregard or suppress background sounds to better understand your spoken requests in noisy environments.

From the article:

The typical audio environment for HomePod has many challenges—echo, reverberation and noise. Unlike Siri on iPhone, which operates close to the user’s mouth, Siri on HomePod must work well in a far-field setting. Users want to invoke Siri from many locations, like the couch or the kitchen, without regard to where HomePod sits.

A complete online system, which addresses all of the environmental issues that HomePod can experience, requires a tight integration of various multichannel signal processing technologies. Accordingly, the Audio Software Engineering and Siri Speech teams built a system that integrates both supervised deep learning models and unsupervised online learning algorithms and that leverages multiple microphone signals.

The system selects the optimal audio stream for the speech recognizer by using top-down knowledge from ‘Hey Siri’ trigger phrase detectors.

The rest of the article discusses using the various machine learning techniques for online signal processing, as well as the challenges Apple faced and their solutions for achieving environmental and algorithmic robustness while ensuring energy efficiency.

Long story short, Siri on HomePod implements the Multichannel Echo Cancellation (MCEC) algorithm which uses a set of linear adaptive filters to model the multiple acoustic paths between the loudspeakers and the microphones to cancel the acoustic coupling.

Due to the close proximity of the loudspeakers to the microphones on HomePod, the playback signal can be significantly louder than a user’s voice command at the microphone positions, especially when the user moves away from the device. In fact, the echo signals may be 30-40 dB louder than the far-field speech signals, resulting in the trigger phrase being undetectable on the microphones during loud music playback.

TLDR: MCEC alone cannot remove the playback signal completely from your voice command.

Siri command recorded in presence of loud playback music: microphone signal (top), output of MCEC (middle) and signal enhanced by Apple’s mask-based echo suppression (bottom)

To remove remaining playback content after the MCEC, HomePod uses a residual echo suppressor (RES) approach with a little help from Apple’s well-trained machine learning model. For successful trigger phrase detection, the RES does things like mitigate residual linear echo, especially in the presence of double-talk and echo path changes.

Be sure to read the full post and scroll down to Section 7, where you have images of multiple colorful waveforms along with links below them allowing you to hear for yourself just how much of a user’s request gets suppressed by music playing at high volume and the playback signal generated by HomePod’s tweeters and woofer.

Tidbit: Apple’s multichannel signal processing runs on one core of the 1.4GHz dual-core A8 silicon and consumes up to 15 percent of the chip’s single-core performance.

HomePod uses machine learning for a lot of things, not just Siri.

Content recommendation algorithms that run on the device benefit from machine learning, as do HomePod’s digital audio processing and sound optimization techniques.