Human perception of the complex world relies on a comprehensive analysis of multi-modal signals, and the co-occurrences of audio and video signals provide humans with rich cues. This paper focuses on novel audio-visual scene synthesis in the real world. Given a video recording of an audio-visual scene, the task is to synthesize new videos with spatial audios along arbitrary novel camera trajectories in that audio-visual scene. Directly using a NeRF-based model for audio synthesis is insufficient due to its lack of prior knowledge and acoustic supervision. To tackle the challenges, we first propose an acoustic-aware audio generation module that integrates our prior knowledge of audio propagation into NeRF, in which we associate audio generation with the 3D geometry of the visual environment. In addition, we propose a coordinate transformation module that expresses a viewing direction relative to the sound source. Such a direction transformation helps the model learn sound source-centric acoustic fields. Moreover, we utilize a head-related impulse response function to synthesize pseudo binaural audio for data augmentation that strengthens training. We qualitatively and quantitatively demonstrate the advantage of our model on real-world audio-visual scenes. We refer interested readers to view our video results for convincing comparisons.
Model overview: Given the position (x,y,z) and viewing direction (θ,φ) of a listener, our method can render an image the listener would see and the corresponding binaural audio the listener would hear. Our model consists of V-NeRF, A-NeRF, and AV-Bridge. V-NeRF learns to generate acoustic masks, A-NeRF learns to generate visual frames and AV-Bridge is optimized to extract geometric information from V-NeRF and incorporate this information into A-NeRF.
Different from past audio-visual learning works, this paper focuses on the synthesis of novel audio-visual scenes in the real world. We define novel audio-visual scene synthesis as a task to synthesize a target video, including visual frames and the corresponding spatial audio, along an arbitrary camera trajectory from given source videos and trajectories. Learning from source video in a real-world environment with binaural audio, the generated target spatial audio and video frames are expected to be consistent with the given camera trajectory visually as well as acoustically to ensure perceptual realism and immersion.
Below you will find videos of our results on Real-World Audio-Visual Scenes. We show the videos we used for training and the videos rendered by different methods, including Mono-Mono, Baseline and Ours. In each video, we show the camera trajectory, sound source, audio level of two channels, rendered visual frames and rendered binaural audios. Because we target modeling acoustic fields and establishing the correlation between visual and acoustic worlds in this paper, we use the same visual rendering results when we compare the acoustic results of different methods. Please wear headphones when watching videos.
We first compare results in the large room.
We show the 360-degree rendering results.
We show the left-to-right rendering results.
We then compare results in the medium room.
We show the 360-degree rendering results.
We show the left-to-right rendering results.
Below you will find videos of our results on the FAIR-PLAY Dataset. We compare our method with MONO2BINAURAL and PSEUDO2BINAURAL. We show results in four scenes, which are Harp, Cello, Drum, and Guitar. For each scene, we show the ground-truth video and synthesized videos by different methods. Considering that MONO2BINAURAL and PSEUDO2BINAURAL do not support novel view synthesis, we input retrieved visual frames with the nearest camera pose to these two models. Please wear headphones when watching videos.
We show the results in the Harp scene.
We show the results in the Cello scene.
We show the results in the Drum scene.
We show the results in the Guitar scene.