Meta Make-A-Video research builds on the recent progress made in text-to-image generation technology built to enable text-to-video generation. The system uses images with descriptions to learn what the world looks like and how it is often described. It also uses unlabeled videos to learn how the world moves. With this data, Make-A-Video lets you bring your imagination to life by generating whimsical, one-of-a-kind videos with just a few words or lines of text.

Make-A-Video has three advantages:

(1) it accelerates training of the T2V model (it does not need to learn visual and multimodal representations from scratch),

(2) it does not require paired text-video data, and

(3) the generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today’s image generation models.

They design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.

1. They decompose the full temporal U-Net and attention tensors and approximate them

in space and time.

2. Tney design a spatial temporal pipeline to generate high resolution and frame rate videos with a video decoder, interpolation model and two super resolution models that can enable various applications besides T2V. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.