Omni-Judge

Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation?

Susan Liang1, Chao Huang1, Filippos Bellos2, Yolo Yunlong Tang1, Qianxiang Shen1, Jing Bi1, Luchuan Song1, Zeliang Zhang1, Jason Corso2, Chenliang Xu1
1University of Rochester, 2University of Michigan, Ann Arbor

Abstract

State-of-the-art text-to-video generation models such as Sora 2 and Veo 3 can now produce high-fidelity videos with synchronized audio directly from a textual prompt, marking a new milestone in multi-modal generation. However, evaluating such tri-modal outputs remains an unsolved challenge. Human evaluation is reliable but costly and difficult to scale, while traditional automatic metrics, such as FVD, CLAP, and ViCLIP, focus on isolated modality pairs, struggle with complex prompts, and provide limited interpretability. Omni-modal large language models (omni-LLMs) present a promising alternative: they naturally process audio, video, and text, support rich reasoning, and offer interpretable chain-of-thought feedback. Driven by this, we introduce Omni-Judge, a study assessing whether omni-LLMs can serve as human-aligned judges for text-conditioned audio-video generation. Across nine perceptual and alignment metrics, Omni-Judge achieves correlation comparable to traditional metrics and excels on semantically demanding tasks such as audio-text alignment, video-text alignment, and audio-video-text coherence. It underperforms on high-FPS perceptual metrics, including video quality and audio-video synchronization, due to limited temporal resolution. Omni-Judge provides interpretable explanations that expose semantic or physical inconsistencies, enabling practical downstream uses such as feedback-based refinement. Our findings highlight both the potential and current limitations of omni-LLMs as unified evaluators for multi-modal generation.

Audio-Video Generation Characteristics

We present several samples generated by Sora 2 and Veo 3. These samples demonstrate the limitations of the current state-of-the-art text-to-video generation models: color jittering, motion inconsistency, and audio-video desynchronization, as well as one feature of the audio-video generation models: ambiguous audio generation.

Color Jittering

Sora 2 shows noticeable color jittering, while Veo 3 produces sharper and more stable frames with smoother temporal consistency.

Prompt: Millet has a set of characteristics that optimize farmers' resources, requiring less water compared to corn and sorghum, while at the same time producing high-quality grains with good market acceptance.

Sora 2 (Color Jittering)

Veo 3

Motion Inconsistency

Sora 2 exhibits frame freezing, whereas Veo 3 maintains fluent motion and higher temporal stability across frames.

Prompt: Spaceship from Ancient Egypt flying over the desert.

Sora 2 (Frame Freezing)

Veo 3

Audio-Video Desynchronization

Sora 2 produces speech without matching lip motion, while Veo 3 generates keyboard typing sounds without corresponding hand movement, illustrating temporal misalignment between sound and action.

Prompt: Dad talking to his daughter about her teeth.

Sora 2 (Desynchronization)

Veo 3

Prompt: A speechless llama coding in the dark room.

Sora 2

Veo 3 (Desynchronization)

Ambiguous Audio Generation

Because many prompts do not contain explicit sound descriptions, the models must infer plausible audio to accompany the visual content. We observe three representative cases of inferred audio behavior. (1) models generate a voice narration. For example, Sora 2 generates a coherent voice narration describing the glowing book, aligning narratively with the visual scene. (2) model introduces a human dialogue. For example, Veo 3 introduces a human dialogue to accompany a dog’s reaction, where the speech semantically matches the implied event rather than the subject itself. (3) models adds background music to align with the visual atmosphere. For example, Veo 3 adds resonant background music to evoke a grand and profound atmosphere, reflecting emotional tone rather than concrete visual actions. These findings demonstrate that current models tend to fill audio ambiguity with narrative, conversational, or emotional cues.

Prompt: A spectral glow emanating from the printed Bible, hinting at the mysterious ink.

Sora 2 (Narration)

Prompt: Dog is suspecting something.

Sora 2 (Dialogue)

Prompt: Stars twinkled in the sky and the moon rose.

Veo 3 (Background Music)

Citation

@article{omni_judge2026,
  title={Omni-Judge: Can Omni-LLMs Serve as Human-Aligned Judges for Text-Conditioned Audio-Video Generation?},
  author={Liang, Susan and Huang, Chao and Bellos, Filippos and Tang, Yolo Yunlong and Shen, Qianxiang and Bi, Jing and Song, Luchuan and Zhang, Zeliang and Corso, Jason and Xu, Chenliang},
  journal={arXiv preprint arXiv:2602.01623},
  year={2026}
}