This post was written by Claude 4.6 as a summary of Jack Saunders’ paper.


Visual dubbing — replacing the lip movements in a video to match a new language — has huge commercial value. Watch any dubbed film and the mismatch between lip motion and audio is immediately distracting. Recent deep learning methods have made progress, but they come with a catch: they either need large amounts of person-specific data, or they sacrifice quality by relying on only a single reference frame.

DEAD (Data-Efficient Audiovisual Dubbing) breaks this trade-off.

The key insight: train a prior first.

Rather than trying to learn a person’s face from scratch with limited data, DEAD first trains a large multi-person prior network across many identities. This prior captures the general structure of how people speak and move. When adapting to a new person, the model only needs to learn the residual — what makes this specific person different — which requires far less data.

The result: high-quality, identity-preserving visual dubbing from just a few seconds of training footage. This makes it practical for dubbing background actors, archival footage, or anyone without a full production setup.

DEAD demonstrates state-of-the-art visual quality and recognisability across two user studies, and outperforms baselines under real-world limited-data conditions.

DEAD was accepted at BMVC 2025.

Links: arXiv · Video