Joe Russo, the director known for "Avengers," shares my growing belief that AI-generated movies and TV shows could become a reality within our lifetimes. Recent developments, particularly OpenAI's impressively realistic text-to-speech engine, have brought us closer to this new era. Meta's latest announcement about Emu Video, an advanced version of their image generation tool Emu, further highlights this progression.
Introduced today, Emu Video can create four-second animated clips from a caption, image, or photo with a description. Alongside this, Meta also unveiled Emu Edit, an AI model that allows users to modify these clips using natural language instructions, like requesting a slow-motion version of the same clip.
While video generation technology isn't new, with Meta and Google having previously dabbled in it and startups like Runway building businesses around it, Emu Video's high-quality 512x512 resolution and 16-frames-per-second clips are particularly impressive. They are so well-crafted that it's often hard to distinguish them from real footage, especially in simpler, static scenes or those rendered in artistic styles like cubism or anime.
However, Emu Video isn't without its quirks. Odd physics and unusual object behaviors are common, and the AI struggles with dynamic actions. For instance, a raccoon might hold a guitar without strumming it, or unicorns might sit by a chessboard without moving any pieces.