Meta Text to Video Generator

Meta Make-A-Video research builds on the recent progress made in text-to-image generation technology built to enable text-to-video generation. The system uses images with descriptions to learn what the world looks like and how it is often described. It also uses unlabeled videos to learn how the world moves. With this data, Make-A-Video lets you bring your imagination to life by generating whimsical, one-of-a-kind videos with just a few words or lines of text.

Make-A-Video has three advantages:
(1) it accelerates training of the T2V model (it does not need to learn visual and multimodal representations from scratch),

(2) it does not require paired text-video data, and

(3) the generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today’s image generation models.

They design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.

1. They decompose the full temporal U-Net and attention tensors and approximate them
in space and time.
2. Tney design a spatial temporal pipeline to generate high resolution and frame rate videos with a video decoder, interpolation model and two super resolution models that can enable various applications besides T2V. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.

4 thoughts on “Meta Text to Video Generator”

  1. If you want to make any money from a video in the future you’ll need to include a coherent plot, dialogue and story. Otherwise it’ll be indistinguishable from free AI content.

    Current industry will need to completely rethink their approach.

  2. The near future for both art and video will be interactive back-and-forth between a human creator and an AI generator. The human will toss out ideas, maybe point the AI at some ‘element’ images, then have the AI generate a range of representations. The human picks a couple and gives the AI instructions on how to adjust the repesentation, etc.

    Art adequate to illustrate stories online will take a creator/AI maybe half a day to produce.

    In an attempt to deal with accusations of digital photo fakes, photographers will use ‘crypto-cams’ that use blockchains to create a continuous chain of custody from the moment of capturing a photo to the moment someone views it in association with a news story. You’ll still have to trust the photographer, but they’d have to put a lot of effort into making a convincing fake, putting their reputation at risk.

    Theatre quality movies will be a while coming, but mocking up a new movie will take a couple people a few weeks, so they can use that to sell a movie idea to a studio.

    The movie equivalent of Fan Fiction will take off – generating thousands of unofficial Harry Potter, Spiderman, and My Little Pony (etc) movies – 80% of them unsuitable for children. Studios will attempt to suppress them, but underground Fan Movies will just keep getting bigger and better and more widely appreciated and consumed. As the tech gets really good, non-derivative creations will make the whole issue moot, as the big studios can’t get enough customers to pay for making movies the old expensive way. Most revenues will come as donations or payments from Youtube distribution equivalents.

  3. And the memes will never be the same.

    Jokes aside, this is a big step towards synthetic movies. Nevertheless, there’s still a long road towards AI generated movies modeling object persistence and coherency.

    Persistence is basically the stuff we get from filming real objects an actors with cameras, or virtual entities that are nevertheless persistent across shots.

    Coherency is the respect of semantic rules in the relationships of objects. Like the fact people enter rooms through doors and not appearing suddenly out of thin air or through the walls, or that wheels turn at synchronized speed with the soil they roll over and a long etc. of real life relationships between objects.

    I imagine a top down approach might work, with language modeling AIs generating stories, others scenes, then individual actions and dialogs. Finally others rendering those into images and videos, probably using predefined assets (actors and scenery) to force some self-consistency in look.

    • Temporal coherence would be the next big challenge for this type of technology, in my opinion. Then after that, overall verisimilitude as the final finishing touch.

Comments are closed.