BinauralFlow

A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models

Susan Liang¹, Dejan Markovic², Israel D. Gebru², Steven Krenn², Todd Keebler², Jacob Sandakly², Frank Yu², Samuel Hassel², Chenliang Xu¹, Alexander Richard²

¹University of Rochester, ²Codec Avatars Lab, Meta

Abstract

Binaural rendering aims to synthesize binaural audio that mimics natural hearing based on a mono audio and the locations of the speaker and listener. Although many methods have been proposed to solve this problem, they struggle with rendering quality and streamable inference. Synthesizing high-quality binaural audio that is indistinguishable from real-world recordings requires precise modeling of binaural cues, room reverb, and ambient sounds. Additionally, real-world applications demand streaming inference. To address these challenges, we propose a flow matching based streaming binaural speech synthesis framework called BinauralFlow. We consider binaural rendering to be a generation problem rather than a regression problem and design a conditional flow matching model to render high-quality audio. Moreover, we design a causal U-Net architecture that estimates the current audio frame solely based on past information to tailor generative models for streaming inference. Finally, we introduce a continuous inference pipeline incorporating streaming STFT/ISTFT operations, a buffer bank, a midpoint solver, and an early skip schedule to improve rendering continuity and speed. Quantitative and qualitative evaluations demonstrate the superiority of our method over SOTA approaches. A perceptual study further reveals that our model is nearly indistinguishable from real-world recordings, with a 42% confusion rate.

Continuous inference pipeline. Starting with a mono audio chunk (top left, black solid-line box), we compute its spectrogram via streaming STFT, add noise, and duplicate the channel to form the noisy spectrogram φ₀(z). The trained model progressively removes the noise with a buffer bank. Finally, streaming ISTFT converts the predicted binaural spectrogram φ₁(z) into binaural audio. When the next audio chunk appears (black dashed-line box), we repeat the process and synthesize seamlessly continuous binaural speech.

Demo Videos

We do a flip test to compare the synthesized sound and the ground-truth sound. We periodically flip the sound between the synthesized sound and the ground-truth speech every 5 seconds. In each video, we show a top-down view of the room along with the poses of the speaker and the listener. The speaker is denoted as "Tx" and the speaker's trajectory is shown in blue. The listener is denoted as "Rx" and the listener's trajectory is shown in red.

Sample 1

Sample 2

Sample 3

Comparison with Baselines

We compare our method with three baselines: Digital Signal Processing (DSP), BinauralGrad, and SGMSE. We also include the mono audio and the ground-truth sound for reference. In each video, we show a top-down view of the room along with the poses of the speaker and the listener. The speaker is denoted as "Tx" and the speaker's trajectory is shown in blue. The listener is denoted as "Rx" and the listener's trajectory is shown in red.

Sample 1

Mono

Digitial Signal Processing (DSP)

BinauralGrad

SGMSE

Ours

Sample 2

Mono

Digitial Signal Processing (DSP)

BinauralGrad

SGMSE

Ours

Sample 3

Mono

Digitial Signal Processing (DSP)

BinauralGrad

SGMSE

Ours

Citation

@article{binauralflow2025,
  title={BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models},
  author={Liang, Susan and Markovic, Dejan and Gebru, Israel D. and Krenn, Steven and Keebler, Todd and Sandakly, Jacob and Yu, Frank and Hassel, Samuel and Xu, Chenliang and Richard, Alexander},
  journal={International Conference on Machine Learning},
  year={2025}
}