MMAudio AI: Open-Source Video-to-Audio Synthesis Model

MMAudio is a state-of-the-art open-source AI model for video-to-audio synthesis developed by researchers at Sony AI and the University of Illinois Urbana-Champaign. Accepted at CVPR 2025, MMAudio generates high-quality, synchronized audio from video inputs, text descriptions, or both. The model achieves new state-of-the-art performance among public models while requiring only 157 million parameters and producing an 8-second audio clip in approximately 1.2 seconds.

How It Works

MMAudio uses a multimodal joint training approach that trains simultaneously on audio-visual and audio-text datasets. This differs from previous methods that either trained only on audio-visual data from scratch or added control modules to pretrained text-to-audio models. By training across multiple data types, MMAudio learns a unified semantic space that improves both audio quality and alignment with visual content. A key innovation is the conditional synchronization module that uses high frame-rate visual features from a self-supervised audio-visual desynchronization detector, achieving frame-level temporal alignment that matches human perceptual precision within 25 milliseconds.

Key Capabilities

Video-to-Audio Synthesis: Generates contextually appropriate audio from silent video content, matching actions, environments, and visual elements with synchronized sound effects and ambient audio.
Text-to-Audio Generation: Produces audio from text descriptions alone, achieving competitive performance compared to dedicated text-to-audio models despite being trained for multimodal tasks.
Multimodal Input: Accepts both video and text inputs simultaneously, producing more accurate and contextually relevant audio than single-modality approaches.
Efficient Architecture: The smallest variant uses only 157M parameters with low inference time (1.23s for an 8-second clip), making it practical for deployment without requiring extensive computational resources.

Technical Approach

The model is trained with a flow matching objective and demonstrates significant improvements over prior work: 10% lower Frechet Distance for audio quality, 15% higher Inception Score, 4% higher ImageBind score for semantic alignment, and 14% better synchronization score for temporal alignment. MMAudio is built on components including BigVGAN for neural vocoding, Synchformer for audio-visual synchronization, and a VAE architecture inspired by Make-An-Audio 2 and EDM2.

Availability

MMAudio is released as open-source software under the MIT license for the code and CC-BY-NC 4.0 for the pretrained model checkpoints. The code is available on GitHub with support for command-line and Gradio demo interfaces. The model runs on modern GPUs with approximately 6GB of VRAM in 16-bit mode. It can be accessed through Replicate for cloud inference and integrated into ComfyUI workflows.

Limitations

The model may occasionally generate unintelligible human speech-like sounds or background music of limited quality. It struggles with unfamiliar concepts and its output quality depends on the training datasets (AudioSet, VGGSound, AudioCaps, WavCaps), which carry their own licensing terms. The standard output duration is 8 seconds, and significant deviations from this duration may reduce quality.

MMAudio

Categories

Overview

How It Works

Key Capabilities

Technical Approach

Availability

Limitations

Tool Overview

Pricing

Similar AI Tools

Poppy AI

Muku AI

Clipchamp

Syllabbles

Amical