π-AVAS: Can Physics-Integrated Audio-Visual Modeling Boost Neural Acoustic Synthesis?

ICCV 2025

 Susan Liang    Chao Huang    Yunlong Tang
Zeliang Zhang    Chenliang Xu

University of Rochester

Abstract

The Audio-Visual Acoustic Synthesis (AVAS) task aims to model realistic audio propagation behavior within a specific visual scene. Prior works often rely on sparse image representations to guide acoustic synthesis. However, we argue that this approach is insufficient to capture the intricate physical properties of the environment and may struggle with generalization across diverse scenes. In this work, we review the limitations of existing pipelines and address the research question: Can we leverage physical audio-visual associations to enhance neural acoustic synthesis? We introduce Physics-Integrated Audio-Visual Acoustic Synthesis (PI-AVAS or π-AVAS), a novel framework designed with two key objectives. i) Generalization: We develop a vision-guided audio simulation framework that leverages physics-based sound propagation. By explicitly modeling vision-grounded geometry and sound rays, our approach achieves robust performance across diverse visual environments. ii) Realism: While simulation-based approaches offer generalizability, they often compromise on realism. To mitigate this, we incorporate a second stage for data-centric refinement, where we propose a flow matching-based audio refinement model to narrow the gap between simulation and real-world audio-visual scenes. Extensive experiments demonstrate the effectiveness and robustness of our method. We achieve state-of-the-art performance on the RWAVS-Gen, RWAVS, and RAF datasets. Additionally, we show that our approach can be seamlessly integrated with existing methods to significantly improve their performance.

💡 Model Overview

feature

Overview of our approach. Our framework π-AVAS consists of two stages: the vision-guided audio simulation and the audio refinement with flow matching. In the first stage, we conduct 3D scene reconstruction and simulate sound propagation between a speaker and a listener in the mesh scene. In the second stage, we refine the coarsely simulated sound with a flow matching model, enhancing the quality of the synthesized sound.

📢 Demo

Below, you will find audio synthesized by our π-AVAS model. For each example, we present a top-down view of the environment, indicating the location of the novel (evaluation) sound source, and the camera pose (listener). To enhance auditory immersion, we include the corresponding egocentric image captured from the camera's perspective. We illustrate the input clean sound, the simulated sound generated by our physics-integrated audio simulation module, the synthesized sound produced by our two-stage model, and the ground-truth (recorded) sound. Additionally, each audio clip is accompanied by its corresponding waveform for better visual interpretation. Please use headphone 🎧 or loudspeaker 🔈 for the best listening experience.

image

Input Audio

Simulated Audio

Predicted Audio (Ours)

Ground-Truth Audio


image

Input Audio

Simulated Audio

Predicted Audio (Ours)

Ground-Truth Audio


image

Input Audio

Simulated Audio

Predicted Audio (Ours)

Ground-Truth Audio


image

Input Audio

Simulated Audio

Predicted Audio (Ours)

Ground-Truth Audio


image

Input Audio

Simulated Audio

Predicted Audio (Ours)

Ground-Truth Audio


image

Input Audio

Simulated Audio

Predicted Audio (Ours)

Ground-Truth Audio


image

Input Audio

Simulated Audio

Predicted Audio (Ours)

Ground-Truth Audio


image

Input Audio

Simulated Audio

Predicted Audio (Ours)

Ground-Truth Audio


image

Input Audio

Simulated Audio

Predicted Audio (Ours)

Ground-Truth Audio


image

Input Audio

Simulated Audio

Predicted Audio (Ours)

Ground-Truth Audio


image

Input Audio

Simulated Audio

Predicted Audio (Ours)

Ground-Truth Audio