PI-AVAS: Can Physics-Integrated Audio-Visual Modeling Boost Neural Acoustic Synthesis?

Abstract

The Audio-Visual Acoustic Synthesis (AVAS) task aims to model realistic audio propagation behavior within a specific visual scene. Prior works often rely on sparse image representations to guide acoustic synthesis. However, we argue that this approach is insufficient to capture the intricate physical properties of the environment and may struggle with generalization across diverse scenes. In this work, we review the limitations of existing pipelines and address the research question: Can we leverage physical audio-visual associations to enhance neural acoustic synthesis? We introduce Physics-Integrated Audio-Visual Acoustic Synthesis (PI-AVAS or π-AVAS), a novel framework designed with two key objectives. i) Generalization: We develop a vision-guided audio simulation framework that leverages physics-based sound propagation. By explicitly modeling vision-grounded geometry and sound rays, our approach achieves robust performance across diverse visual environments. ii) Realism: While simulation-based approaches offer generalizability, they often compromise on realism. To mitigate this, we incorporate a second stage for data-centric refinement, where we propose a flow matching-based audio refinement model to narrow the gap between simulation and real-world audio-visual scenes. Extensive experiments demonstrate the effectiveness and robustness of our method. We achieve state-of-the-art performance on the RWAVS-Gen, RWAVS, and RAF datasets. Additionally, we show that our approach can be seamlessly integrated with existing methods to significantly improve their performance.

📢 Demo

Below, you will find audio synthesized by our π-AVAS model. For each example, we present a top-down view of the environment, indicating the location of the novel (evaluation) sound source, and the camera pose (listener). To enhance auditory immersion, we include the corresponding egocentric image captured from the camera's perspective. We illustrate the input clean sound, the simulated sound generated by our physics-integrated audio simulation module, the synthesized sound produced by our two-stage model, and the ground-truth (recorded) sound. Additionally, each audio clip is accompanied by its corresponding waveform for better visual interpretation. Please use headphone 🎧 or loudspeaker 🔈 for the best listening experience.