VASA-1: After Google, Microsoft Introduces Its Image-To-Video Model That Can Generate Lifelike Portraits Of Humans Speaking; Details Inside

By Vikas Yadav
Fri, 19 Apr 2024 04:34 PM (IST)

Source:JND

Microsoft VASA-1: Windows maker Microsoft, in its research blog post, detailed an AI model Visual Affective Skills Audio (VASA-1), which can generate lifelike faces using a single image and audio clip. Apart from clean lip movement to sync with the audio, it can also capture a varying spectrum of facial cues and head movements to mimic liveliness and authenticity. It supports media generation of up to one-minute 512x512 resolution videos at up to 45 frames per second.

While the videos have an initial latency (170ms), it does offer a range of possibilities to emulate conversational behaviour via lifelike AI avatars. Microsoft shared a range of output generated using StyleGAN2 or DALL·E-3 and said it is only a demonstration for research purposes.

Also Read: Google Researchers Introduce Multimodal AI Tool 'VLOGGER' That Can Generate Realistic Videos Of A Person Using Just One Image; Details

"Our method is capable of not only producing precious lip-audio synchronisation, but also generating a large spectrum of expressive facial nuances and natural head motions," Microsoft said. It uses audio output to generate talking face videos. The diffusion model can accept other signals while generating output, including "eye gaze direction, head distance and emotion offsets". Here are a few examples shared by Microsoft floating on X (formerly called Twitter).

3. Realism and liveliness - example 2 pic.twitter.com/7nVrTtDUmM
— Min Choi (@minchoi) April 18, 2024

Product demo for Microsoft's VASA-1, which uses AI to generate realistic (ish) video of someone talking from a single photo and audio clip.

Still a bit uncanny.

I wonder what it could do based on an entire curated dataset of photos and videos.. pic.twitter.com/srrE0l8DUx
— Nick Tsergas (@nicktsergas) April 19, 2024

Microsoft just dropped VASA-1.

This AI can make single image sing and talk from audio reference expressively. Similar to EMO from Alibaba

10 wild examples:

1. Mona Lisa rapping Paparazzi pic.twitter.com/LSGF3mMVnD
— Min Choi (@minchoi) April 18, 2024

2. Realism and liveliness - example 1 pic.twitter.com/Kz0Bm2NRNy
— Min Choi (@minchoi) April 18, 2024

9. Out-of-distribution generalization - singing audios pic.twitter.com/D5HhBpirWh
— Min Choi (@minchoi) April 18, 2024

Microsoft also shared that it can also handle artistic photos to portray them as singing songs and speaking in a non-English language. It can disentangle 3D head pose and facial dynamics to enable attribute controls and editing for the output. Here is an example shared by Microsoft evaluated on a desktop with an NVIDIA RTX 4090 GPU:

(Video Source:Microsoft Research)

Microsoft highlighted its use cases in communication, education, healthcare and other diverse domains. While highlighting the limitations, the company noted that the model can only process output up to the torso. Another notable challenge is the lack of involvement with elements like hair and clothing.

Also Read: Sora AI Video Generator: Bicycle Race On Ocean To Drone Race On Mars, Check Out Results From OpenAI's Text-To-Video Tool

To avoid the creation of content that may mislead and deceive viewers, the company will adopt advanced forgery detection measures. "We have no plans to release an online demo, API, product, additional implementation details, or any related offerings until we are certain that the technology will be used responsibly and in accordance with proper regulations," Microsoft said.

VASA-1: Microsoft Introduces Image-To-Video Tool To Generate Lifelike Clips Of Humans Speaking Via Photo, But You Can't Try it