AV-NeRF: Learning Neural Fields for Real-World
Audio-Visual Scene Synthesis

NeurIPS 2023

 Susan Liang1    Chao Huang1        Yapeng Tian1
Anurag Kumar2    Chenliang Xu1

1University of Rochester    2Meta Reality Labs Research

[Paper (arxiv)] [Code/Dataset]

feature
Abstract

Can machines recording an audio-visual scene produce realistic, matching audio-visual experiences at novel positions and novel view directions? We answer it by studying a new task---real-world audio-visual scene synthesis---and a first-of-its-kind NeRF-based approach for multimodal learning. Concretely, given a video recording of an audio-visual scene, the task is to synthesize new videos with spatial audios along arbitrary novel camera trajectories in that scene. We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF, in which we implicitly associate audio generation with the 3D geometry and material properties of a visual environment. Furthermore, we present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields. To facilitate the study of this new task, we collect a high-quality Real-World Audio-Visual Scene (RWAVS) dataset. We demonstrate the advantages of our method on this real-world dataset and the simulation-based SoundSpaces dataset.

Task Definition

The real-world audio-visual scene synthesis task is to generate visual frames and corresponding binaural audios for arbitrary camera trajectories. In this task, a static environment is given along with multiple observations of the environment. Each observation comprises the camera pose (p=(x,y,z,θ,φ)), the mono source audio clip (a_s), the recorded binaural audio (a_t), and the image (I). The objective is to synthesize a new binaural audio and a novel view based on a camera pose query and a source audio. It is important to note that the camera pose query is distinct from the camera poses in all observations.

To intuitively understand this task, consider the following video example: Given audio-visual observations captured from three different poses (1, 2, and 3), the task is to synthesize the visual image and the binaural audio for a new pose (4).


Model Overview

feature

The pipeline of our method. Given the position (x,y,z) and viewing direction (θ,φ) of a listener, our method can render an image the listener would see and the corresponding binaural audio the listener would hear. Our model consists of V-NeRF, A-NeRF, and AV-Mapper. A-NeRF learns to generate acoustic masks, V-NeRF learns to generate visual frames, and AV-Mapper is optimized to integrate geometry and material information extracted from V-NeRF into A-NeRF.


Demo Videos

Below, you will find videos showcasing our results on the RWAVS dataset. Each video includes the camera trajectory, sound source, audio levels of two channels, rendered visual frames, and rendered binaural audio. The camera trajectory is displayed in the top left corner, with blue triangles representing camera poses of training samples and red triangles representing novel poses. The sound source is shown both in the camera trajectory and in the rendered visual frames. To provide an intuitive understanding of the energy of the synthesized binaural audio, we visualize the energy levels of both the left and right channels in the bottom left corner. Our model exhibits three key characteristics, which are prominently highlighted in these videos:

Please wear headphones when watching videos.

RWAVS Dataset

Below, you will find videos displaying training samples from the RWAVS dataset, which encompasses diverse environments and a wide range of camera poses. We recorded videos in both indoor and outdoor environments, which we believe represent most daily settings. During data recording, we randomly moved around the environment while holding the device, allowing us to capture various acoustic and visual signals corresponding to different camera poses.