• Source:JND

AI-based video generation has been on the rise in the recent past. Amid the ongoing debates of deepfakes and powerful tools like OpenAI's Sora, Google researchers have announced VLOGGER. It is a "text and audio-driven" AI model that can generate videos using just one image of a person. Building on generative diffusion models, the apparatus works via two models.

It includes the "stochastic human-to-3d-motion diffusion model" and an architecture to augment text-to-image models with spatial and temporal controls. These advanced machine learning models help generate realistic high-quality footage of people talking. The output is "controllable" and can vary in length to deliver a realistic representation of the face and the body.

Also Read: Sora AI Video Generator: Bicycle Race On Ocean To Drone Race On Mars, Check Out Results From OpenAI's Text-To-Video Tool

According to the researchers, the method does not require any specific training to generate the dynamic video of an individual. It "does not rely on face detection and cropping, generates the complete image (not just the face or the lips), and considers a broad spectrum of scenarios" such as the torso and other identities.

(Video Source:enriccorona.github.io/vlogger)

In simple words, the model processes the photo and the audio clip and then produces a video to match the audio and represent the person in the photo as talking and making hand and facial gestures. It has been curated on a MENTOR dataset which is larger than 2,200 hours and 8,00,000 indentities. This helps VLOGGER personalise results based on varying factors such as age, ethnicity, clothes and more.

It can be used to edit existing videos such as changing expressions of a subject or adding frames, according to the researcher led by Enric Corona. Other use cases include changing the spoken language in a video while keeping the lip sync and facial expressions realistic.

Also Read: ImageFX To MusicFX To TextFX, All About Google's Latest Generative AI Playgrounds Now Live

"VLOGGER can be used as a stand-alone solution for presentations, education, narration, low-bandwidth online communication, and as an interface for text-only HCI [Human-Computer Interaction]," the authors noted. While the output shared on the website has a set of downsides that make it appear artificial in the current setting (such as robotic expressions and static background), it does raise several concerns about its misuse.