Room impulse response (RIR), which measures the sound propagation within an environment, is critical for synthesizing high-fidelity audio for a given environment. Some prior work has proposed representing RIR as a neural field function of the sound emitter and receiver positions. However, these methods do not sufficiently consider the acoustic properties of an audio scene, leading to unsatisfactory performance. This letter proposes a novel Neural Acoustic Context Field approach, called NACF, to parameterize an audio scene by leveraging multiple acoustic contexts, such as geometry, material property, and spatial information. Driven by the unique properties of RIR, i.e., temporal un-smoothness and monotonic energy attenuation, we design a temporal correlation module and multi-scale energy decay criterion. Experimental results show that NACF outperforms existing field-based methods by a notable margin.
Method overview. The left (a) is the top-down view of an example indoor scene. We sample points evenly along the room boundary and extract various contextual information at each point, such as the RGB image, depth image, the acoustic coefficients of the surface, and several spatial information. The middle (b) is the architecture of our NACF model. First, we feed multiple acoustic contexts extracted along the room boundary (a) into the multi-modal fusion module. Then we integrate the fused contextual information with the time query as the spatial-temporal query, which is the input to the implicit neural field. After the neural field generates the RIR, we utilize a temporal correlation module to refine the RIR. Finally, we adopt the multi-scale energy decay criterion to supervise the model training. The right (c) is the visualization of predicted and ground-truth RIR together with generation errors.
We present a series of demonstration videos in which we utilize our proposed model to estimate room impulse responses (RIR) and convolve them with music and speech to synthesize realistic indoor sounds. To facilitate a more immersive auditory experience, we provide accompanying ego-centric videos of sound receivers and top-down maps. The top-down map displays a blue dot indicating the location of the sound source, while a triangle indicates the position and orientation of the sound receiver. It is recommended to use headphones for the best listening experience.
Due to the challenges of evaluating the quality of RIR generation by human perception, we conduct ablation studies by convolving RIR rendered by our model with speech. Below we show the clean speech, speech convolved with ground-truth RIR, and speech generated by NACF w/o M, NACF w/o C, NACF, and NACF w/ T. NACF w/o M is NACF without multi-scale energy decay criterion, NACF w/o C is without the acoustic context module, and NACF w/ T means NACF with the temporal correlation module. To enhance the auditory experience, we provide accompanying ego-centric 360-degree images of sound receivers and top-down maps. The top-down map displays a red dot indicating the location of the sound source, while a triangle indicates the position and orientation of the sound receiver. It is recommended to use headphones for the best listening experience.