TalkLoRA: Low-Rank Adaptation for Speech-Driven Animation

This post was written by Claude 4.6 as a summary of Jack Saunders’ paper.

Transformer-based models for speech-driven facial animation have become the state of the art — but they have two practical problems. First, they produce a kind of averaged, generic speaking style rather than capturing how a specific person talks. Second, their quadratic attention complexity makes them slow and memory-hungry for long sentences.

TalkLoRA solves both with a clean, lightweight approach.

Personalisation via Low-Rank Adaptation.

LoRA (Low-Rank Adaptation) was originally developed for efficiently fine-tuning large language models. TalkLoRA applies the same idea to facial animation: for each subject, a small set of low-rank parameter adaptors is learned that captures their individual speaking style. These adaptors are compact — a tiny fraction of the full model size — and can be trained from minimal data. At inference, swapping adaptors swaps speaking styles.

Chunking for efficiency.

For long audio inputs, full self-attention over the entire sequence is prohibitively expensive. TalkLoRA introduces a chunking strategy that processes audio in overlapping windows, reducing transformer complexity by an order of magnitude without any measurable quality loss. The overlap between chunks ensures smooth, continuous output at the boundaries.

Together, these two contributions make personalised, efficient speech-driven animation practical at scale. TalkLoRA also provides empirical guidance on LoRA hyperparameter selection — rank, placement — for this specific domain.

TalkLoRA was accepted at BMVC 2024.

Links: arXiv · PDF