DeepMind unveils V2A, an AI that creates soundtracks for videos. It uses video and audio data to generate music, effects, and dialogue. While not perfect, it raises concerns about the impact on creative jobs.
DeepMind, Googleās AI research lab, says itās developing AI tech to generate soundtracks for videos.
In aĀ post on its official blog, DeepMind says that it sees the tech V2A (short for āvideo-to-audioā) as an essential piece of the AI-generated media puzzle. While plenty of organizations,Ā including DeepMind, have developed video-generating AI models, these models canāt create sound effects to sync with the videos that they generate.
āVideo generation models are advancing at an incredible pace, but many current systems can only generate silent output,ā DeepMind writes. āV2A technology [could] become a promising approach for bringing generated movies to life.ā
DeepMindās V2A tech takes the description of a soundtrack (e.g., ājellyfish pulsating underwater, marine life, oceanā) paired with a video to create music, sound effects, and even dialogue that matches the characters and tone of the video, watermarked by DeepMindās deepfakes-combating SynthID technology. The AI powering V2A, a diffusion model, was trained on a combination of sounds and dialogue transcripts as well as video clips, DeepMind says.
āBy training on video, audio, and the additional annotations, our technology learns to associate specific audio events with various visual scenes while responding to the information provided in the annotations or transcripts,ā according to DeepMind.
Mumās the word on whether any training data was copyrighted ā and whether the dataās creators were informed of DeepMindās work. Weāve contacted DeepMind for clarification and will update this post if we hear back.
AI-powered sound-generating tools arenāt novel.Ā Startup Stability AI released one just last week, andĀ ElevenLabs launched one in May. Nor are models to create video sound effects. A MicrosoftĀ projectĀ can generate talking and singing videos from a still image, and platforms likeĀ PikaĀ andĀ GenreXĀ have trained models to take a video and make a best guess at what music or effects are appropriate in a given scene.
However, DeepMind claims that its V2A tech is unique in that it can understand the raw pixels from a video and sync generated sounds with the video automatically, optionally sans description.
V2A isnāt perfect, and DeepMind acknowledges this. Because the underlying model wasnāt trained on many videos with artifacts or distortions, it doesnāt create particularly high-quality audio for these. In general, the generated audio isnāt superĀ convincing; my colleague Natasha Lomas described it as āa smorgasbord of stereotypical sounds,ā and I canāt say I disagree.
For those reasons and to prevent misuse, DeepMind says it wonāt release the tech to the public anytime soon.
Also Read: Web Scraping for AI Training: Can it Comply with GDPR?
āTo make sure our V2A technology can positively impact the creative community, weāre gathering diverse perspectives and insights from leading creators and filmmakers and using this valuable feedback to inform our ongoing research and development,ā DeepMind writes. āBefore we consider opening access to it to the wider public, our V2A technology will undergo rigorous safety assessments and testing.ā
DeepMind pitches its V2A technology as an especially useful tool for archivists and those working with historical footage. But generative AI along these lines also threatens to upend the film and TV industry. Itāll take some seriously strong labor protections to ensure that generative media tools donāt eliminate jobsāor, as the case may be, entire professions.


