News » The Monalisa of Deepfakes Is Here: Microsoft’s VASA-1 CAn Create Your Virtual Double With a Single Image and Audio Sample

The Monalisa of Deepfakes Is Here: Microsoft’s VASA-1 CAn Create Your Virtual Double With a Single Image and Audio Sample

Rachi Bhilwara

| Updated on April 22, 2024

Earlier this week, Microsoft Research Asia unveiled the world’s first AI model that can accurately create pretty lifelike videos. This model is capable of creating amazingly realistic videos of a person singing or talking with lifelike accuracy with just a single photo and audio file as a sample.

This could revolutionize the future of content creation, as anyone with a photo and audio file can create videos of a person without any tools and make that person whatever they want.

The VASA framework (also known as Visual Affective Skills Animator) uses machine learning to analyze static images and speech patterns on the audio file to generate some amazingly realistic videos with accurate facial expressions, head movements, and lip-syncing.

This means that, unlike other AI models on the market that clone or simulate a person’s voice, Microsoft’s VASA relies solely on the given audio sample that could be specifically recorded or spoken for a single purpose.

Microsoft claims that this AI model can easily outperform any previous speech animation methods in terms of speech, expressions, and realism. And to our eyes, it does seem like Microsoft’s claims are somewhat real, as the videos generated by this model are by far the best of all its previous predecessors.