The system takes a number of images of a person – and that number could be just one, or more for better results – and runs it through an off-the-shelf "face landmark tracker" to work out where the eyes, eyebrows, nose, lips and jawline are. It does the same for another "driving" source video, going frame by frame to track the motion of these face landmarks.


There's a separate meta-learning stage, in which different AI networks are trained to do different jobs, using an enormous video dataset of talking heads. An Embedder network takes source frames and their landmark tracking data to create vectors, while a Generator network learns to take vectors and images, and generate short videos in which the still faces are animated to move according to the vector movement.

The technology can take a video of one person talking, and map it onto another person's...

The third "Discriminator" network sets up the adversarial relationship – it learns to look at videos of moving faces, and tell which ones are real videos, and which ones have been faked up by the Generator network. So you've got two networks working against each other – one trying to fool the other, the other trying to spot fakes.


These networks start out really bad at their jobs, but as they perform their jobs millions of times, they begin to improve, and the competition between the two networks is what drives both to continue getting better. The Discriminator network isn't looking for the same things a human fake-spotter might be looking for, but it doesn't matter – whatever it's looking for, it keeps getting better at discriminating, so the Generator network has to keep getting better to keep fooling it.


This is another glimpse at the very exciting potential of Generative Adversarial Networks, which are popping up all over the AI world. But to truly appreciate it, you need to watch the video below. Skip to 4:16 if you want to get straight to seeing how the model performs with single-shot stills of Marilyn Monroe, Salvador Dali, Rasputin and Einstein, and then on to paintings.

The technology can operate using a single source photo, since it's got millions of other faces...

Driven by three different driver videos, each face displays three very different personalities – it's a nod to just how much an actor can change their perceived personality by learning to use their face and body muscles in different ways. And seeing the Mona Lisa come to life might bring a smile to your lips – until you consider what developments like this mean for ever more realistic and easier to produce deepfakes.


The Samsung AI team's research paper is available at Arxiv.