Microsoft researchers have released a paper on their newest project VASA-1. You feed it a single picture of a person and a speech sample. It can then generate a talking face out of that with perfect lip-sync.
The abstract of the new research from Microsoft has a TL;DR version that you must read:
TL;DR: single portrait photo + speech audio = hyper-realistic talking face video with precise lip-audio sync, lifelike facial behavior, and naturalistic head movements, generated in real time.
You can theoretically take someone else’s face, add their voice sample, and have it say anything so realistically that it’d feel real. Today’s deepfaking is already pretty realistic, but at least a company like Microsoft wasn’t throwing its weight behind it.
Well, the research paper is just that at the moment. There’s no actual tool open to the public. The researchers make it clear that all demonstrations are AI-generated people and that there’s no product or API release plan.
The results are astonishing.
You can tweak the video generation really well. For example, you can make the head face a different angle, make the head appear zoomed in or zoomed out, and change the expression to something else.
The videos generated using the method perfected by the Microsoft researchers are 512×512 px (standard size for this kind of work assuming today’s computing power) at 45 FPS. They tested it on a single Nvidia RTX 4090 GPU. With a beefy card like that, you can use a variety of other generation tools for videos and batches of images and get similarly fast results.
VASA means “Visual Affective Skills Animator.” That “-1” added to the model’s name tells us that the story is just beginning.
Further reading: