TalkLoRA

Personalised speech-driven facial animation using LoRA adaptors. Efficient transformer inference via chunking. BMVC 2024.

arXiv   PDF

Authors: Jack Saunders, Vinay P. Namboodiri — University of Bath

Venue: British Machine Vision Conference (BMVC) 2024


Abstract

Transformer-based speech-driven facial animation models suffer from two key limitations: they are difficult to adapt to new personalised speaking styles, and they are computationally inefficient for long sentences due to the quadratic complexity of attention.

TalkLoRA addresses both. We apply Low-Rank Adaptation (LoRA) to learn small, subject-specific parameter adaptors that capture individual speaking styles with minimal training data. We additionally introduce a chunking strategy that processes audio in overlapping windows, reducing transformer complexity by an order of magnitude without sacrificing quality.

TalkLoRA achieves state-of-the-art style adaptation and provides practical guidance on LoRA hyperparameter selection for speech-driven animation.


Results

Qualitative comparison against baselines. Error heatmaps show TalkLoRA produces more accurate facial animations, particularly around the mouth region.

Citation

@inproceedings{Saunders2024TalkLoRA,
  author    = {Saunders, Jack and Namboodiri, Vinay P.},
  title     = {TalkLoRA: Low-Rank Adaptation for Speech-Driven Animation},
  booktitle = {Proceedings of the British Machine Vision Conference (BMVC)},
  year      = {2024},
}