DEAD: Data-Efficient Audiovisual Dubbing using Neural Rendering Priors
This post was written by Claude 4.6 as a summary of Jack Saunders’ paper.
Visual dubbing — replacing the lip movements in a video to match a new language — has huge commercial value. Watch any dubbed film and the mismatch between lip motion and audio is immediately distracting. Recent deep learning methods have made progress, but they come with a catch: they either need large amounts of person-specific data, or they sacrifice quality by relying on only a single reference frame.
DEAD (Data-Efficient Audiovisual Dubbing) breaks this trade-off.
The key insight: train a prior first.
Rather than trying to learn a person’s face from scratch with limited data, DEAD first trains a large multi-person prior network across many identities. This prior captures the general structure of how people speak and move. When adapting to a new person, the model only needs to learn the residual — what makes this specific person different — which requires far less data.
The result: high-quality, identity-preserving visual dubbing from just a few seconds of training footage. This makes it practical for dubbing background actors, archival footage, or anyone without a full production setup.
DEAD demonstrates state-of-the-art visual quality and recognisability across two user studies, and outperforms baselines under real-world limited-data conditions.
DEAD was accepted at BMVC 2025.