MMAudio thumbnail

MMAudio

MMAudio is an open-source AI model from Sony AI and UIUC that generates high-quality, synchronized audio from video and text inputs, achieving state-of-the-art video-to-audio synthesis at CVPR 2025.

0.0 (0 reviews)

Categories

Overview

MMAudio is a state-of-the-art open-source AI model for video-to-audio synthesis developed by researchers at Sony AI and the University of Illinois Urbana-Champaign. Accepted at CVPR 2025, MMAudio generates high-quality, synchronized audio from video inputs, text descriptions, or both. The model achieves new state-of-the-art performance among public models while requiring only 157 million parameters and producing an 8-second audio clip in approximately 1.2 seconds.

How It Works

MMAudio uses a multimodal joint training approach that trains simultaneously on audio-visual and audio-text datasets. This differs from previous methods that either trained only on audio-visual data from scratch or added control modules to pretrained text-to-audio models. By training across multiple data types, MMAudio learns a unified semantic space that improves both audio quality and alignment with visual content. A key innovation is the conditional synchronization module that uses high frame-rate visual features from a self-supervised audio-visual desynchronization detector, achieving frame-level temporal alignment that matches human perceptual precision within 25 milliseconds.

Key Capabilities

  • Video-to-Audio Synthesis: Generates contextually appropriate audio from silent video content, matching actions, environments, and visual elements with synchronized sound effects and ambient audio.
  • Text-to-Audio Generation: Produces audio from text descriptions alone, achieving competitive performance compared to dedicated text-to-audio models despite being trained for multimodal tasks.
  • Multimodal Input: Accepts both video and text inputs simultaneously, producing more accurate and contextually relevant audio than single-modality approaches.
  • Efficient Architecture: The smallest variant uses only 157M parameters with low inference time (1.23s for an 8-second clip), making it practical for deployment without requiring extensive computational resources.

Technical Approach

The model is trained with a flow matching objective and demonstrates significant improvements over prior work: 10% lower Frechet Distance for audio quality, 15% higher Inception Score, 4% higher ImageBind score for semantic alignment, and 14% better synchronization score for temporal alignment. MMAudio is built on components including BigVGAN for neural vocoding, Synchformer for audio-visual synchronization, and a VAE architecture inspired by Make-An-Audio 2 and EDM2.

Availability

MMAudio is released as open-source software under the MIT license for the code and CC-BY-NC 4.0 for the pretrained model checkpoints. The code is available on GitHub with support for command-line and Gradio demo interfaces. The model runs on modern GPUs with approximately 6GB of VRAM in 16-bit mode. It can be accessed through Replicate for cloud inference and integrated into ComfyUI workflows.

Limitations

The model may occasionally generate unintelligible human speech-like sounds or background music of limited quality. It struggles with unfamiliar concepts and its output quality depends on the training datasets (AudioSet, VGGSound, AudioCaps, WavCaps), which carry their own licensing terms. The standard output duration is 8 seconds, and significant deviations from this duration may reduce quality.

Tool Overview

Pricing

Not specified
Added:...
Updated:...

Similar AI Tools

Poppy AI thumbnail

Poppy AI

Multiplayer AI workspace for analyzing videos, podcasts, PDFs, and voice notes to create viral content and brainstorm ideas collaboratively.

0.0(0)
Muku AI thumbnail

Muku AI

Muku AI is an AI influencer agency platform that transforms product URLs, scripts, and ideas into professional UGC-style video ads.

0.0(0)
Clipchamp thumbnail

Clipchamp

Microsoft AI-powered online video editor for creating, editing, and sharing HD videos with no expertise required.

0.0(0)
Syllabbles thumbnail

Syllabbles

All-in-one platform to create ebooks, flipbooks, audiobooks, podcasts, and designs from any source — AI, files, URLs, voice, or video.

0.0(0)
Amical thumbnail

Amical

Open-source AI dictation app that types 4x faster with voice. Works in any app with context-aware formatting, custom vocabulary, and 100+ language support.

0.0(0)