9.6 C
Casper
Sunday, May 26, 2024

Google’s Lumiere: New AI Model that Creates Realistic Videos

Must read

Lumiere is a cutting-edge text-to-video model that uses a new technique to create realistic videos from short text inputs.

Google has unveiled a new text-to-video model that generates lifelike videos from short text inputs.

Lumiere creates videos that showcase realistic motion and can even use images and other videos as inputs to improve results. Unveiled in a paper titled ‘A Space-Time Diffusion Model for Video Generation,’ Lumiere works differently from existing video generation models. It generates a temporal duration of the video at once, whereas existing models synthesize distant keyframes followed by temporal super-resolution.

Put simply, Lumiere focuses on the movement of objects in the image, whereas prior systems patch together a video from key frames where the movement already happened.

The model is capable of generating videos comprised of 80 frames. Stability’s Stable Video Diffusion clocks in at 14 and 25 frames. The more frames, the smoother the motion of the video.

According to Google’s team, Lumiere outperforms rival video generation models from the likes of Pika, Meta, and Runway across various tests, including zero-shot trials.

The researchers also contend that Lumiere produces state-of-the-art generation outputs due to its alternative approach. They claim Lumiere’s outputs could be used in content creation tasks and video editing, including video inpainting and stylized generation (mimicking artistic styles it is shown) by using fine-tuned text-to-image model weights.

To achieve its results, Lumiere leverages a new architecture, Space-Time U-Net. This generates the entire temporal duration of the video simultaneously through a single pass in the model.

The Google team wrote that the novel approach improves consistency in outputs. “By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-framerate, low-resolution video by processing it in multiple space-time scales,” the paper reads.

The goal of the Lumiere project was to create a system to enable novice users to create video content more easily.

However, the paper acknowledges the risk of potential misuse, specifically warning models like Lumiere, which could be used to create fake or harmful content.

“We believe that it is crucial to develop and apply tools for detecting biases and malicious use cases to ensure a safe and fair use,” the paper reads.

Google has not made the model available to the public at the time of writing. However, you can explore various example generations on the showcase page on GitHub.

Google steps up video work

Lumiere follows VideoPoet, a Google-produced multimodal model that creates videos from text, video and image inputs. Unveiled last December, VideoPoet uses a decoder-only transformer architecture, making it capable of creating content it has yet to be trained on.

Google has developed several video generation models, including Phenaki and Imagen Video, and plans to cover AI-generated videos with its detection tool SynthID.

Google’s video work compliments its Gemini foundation model, specifically the Pro Vision multimodal endpoint capable of handling images and video as input while generating text as output.

More articles

Latest news