- By Vikas Yadav
- Fri, 19 Apr 2024 04:34 PM (IST)
- Source:JND
Microsoft VASA-1: Windows maker Microsoft, in its research blog post, detailed an AI model Visual Affective Skills Audio (VASA-1), which can generate lifelike faces using a single image and audio clip. Apart from clean lip movement to sync with the audio, it can also capture a varying spectrum of facial cues and head movements to mimic liveliness and authenticity. It supports media generation of up to one-minute 512x512 resolution videos at up to 45 frames per second.
While the videos have an initial latency (170ms), it does offer a range of possibilities to emulate conversational behaviour via lifelike AI avatars. Microsoft shared a range of output generated using StyleGAN2 or DALL·E-3 and said it is only a demonstration for research purposes.
"Our method is capable of not only producing precious lip-audio synchronisation, but also generating a large spectrum of expressive facial nuances and natural head motions," Microsoft said. It uses audio output to generate talking face videos. The diffusion model can accept other signals while generating output, including "eye gaze direction, head distance and emotion offsets". Here are a few examples shared by Microsoft floating on X (formerly called Twitter).
1.
3. Realism and liveliness - example 2 pic.twitter.com/7nVrTtDUmM
— Min Choi (@minchoi) April 18, 2024
2.
Product demo for Microsoft's VASA-1, which uses AI to generate realistic (ish) video of someone talking from a single photo and audio clip.
— Nick Tsergas (@nicktsergas) April 19, 2024
Still a bit uncanny.
I wonder what it could do based on an entire curated dataset of photos and videos.. pic.twitter.com/srrE0l8DUx
3.
Microsoft just dropped VASA-1.
— Min Choi (@minchoi) April 18, 2024
This AI can make single image sing and talk from audio reference expressively. Similar to EMO from Alibaba
10 wild examples:
1. Mona Lisa rapping Paparazzi pic.twitter.com/LSGF3mMVnD
4.
2. Realism and liveliness - example 1 pic.twitter.com/Kz0Bm2NRNy
— Min Choi (@minchoi) April 18, 2024
5.
9. Out-of-distribution generalization - singing audios pic.twitter.com/D5HhBpirWh
— Min Choi (@minchoi) April 18, 2024
Microsoft also shared that it can also handle artistic photos to portray them as singing songs and speaking in a non-English language. It can disentangle 3D head pose and facial dynamics to enable attribute controls and editing for the output. Here is an example shared by Microsoft evaluated on a desktop with an NVIDIA RTX 4090 GPU:
(Video Source:Microsoft Research)
Microsoft highlighted its use cases in communication, education, healthcare and other diverse domains. While highlighting the limitations, the company noted that the model can only process output up to the torso. Another notable challenge is the lack of involvement with elements like hair and clothing.
To avoid the creation of content that may mislead and deceive viewers, the company will adopt advanced forgery detection measures. "We have no plans to release an online demo, API, product, additional implementation details, or any related offerings until we are certain that the technology will be used responsibly and in accordance with proper regulations," Microsoft said.