• Source:JND

Microsoft VASA-1: Windows maker Microsoft, in its research blog post, detailed an AI model Visual Affective Skills Audio (VASA-1), which can generate lifelike faces using a single image and audio clip. Apart from clean lip movement to sync with the audio, it can also capture a varying spectrum of facial cues and head movements to mimic liveliness and authenticity. It supports media generation of up to one-minute 512x512 resolution videos at up to 45 frames per second.

While the videos have an initial latency (170ms), it does offer a range of possibilities to emulate conversational behaviour via lifelike AI avatars. Microsoft shared a range of output generated using StyleGAN2 or DALL·E-3 and said it is only a demonstration for research purposes.

Also Read: Google Researchers Introduce Multimodal AI Tool 'VLOGGER' That Can Generate Realistic Videos Of A Person Using Just One Image; Details

"Our method is capable of not only producing precious lip-audio synchronisation, but also generating a large spectrum of expressive facial nuances and natural head motions," Microsoft said. It uses audio output to generate talking face videos. The diffusion model can accept other signals while generating output, including "eye gaze direction, head distance and emotion offsets". Here are a few examples shared by Microsoft floating on X (formerly called Twitter).

1.

2.

3.

4.

5.

Microsoft also shared that it can also handle artistic photos to portray them as singing songs and speaking in a non-English language. It can disentangle 3D head pose and facial dynamics to enable attribute controls and editing for the output. Here is an example shared by Microsoft evaluated on a desktop with an NVIDIA RTX 4090 GPU:

(Video Source:Microsoft Research)

Microsoft highlighted its use cases in communication, education, healthcare and other diverse domains. While highlighting the limitations, the company noted that the model can only process output up to the torso. Another notable challenge is the lack of involvement with elements like hair and clothing.

Also Read: Sora AI Video Generator: Bicycle Race On Ocean To Drone Race On Mars, Check Out Results From OpenAI's Text-To-Video Tool

To avoid the creation of content that may mislead and deceive viewers, the company will adopt advanced forgery detection measures. "We have no plans to release an online demo, API, product, additional implementation details, or any related offerings until we are certain that the technology will be used responsibly and in accordance with proper regulations," Microsoft said.