Welcome.
This repository contains the official implementation of TempoSyncDiff, a framework for audio-driven talking-head generation designed to achieve low-latency inference while maintaining temporal coherence and identity consistency.
TempoSyncDiff explores whether diffusion models can be distilled into efficient few-step generators without substantially degrading visual fidelity or motion stability.
Title
TempoSyncDiff: Distilled Temporally-Consistent Diffusion for Low-Latency Audio-Driven Talking Head Generation
The manuscript is currently submitted for peer review.
Until formal publication, the arXiv version serves as the primary public reference.
Audio-driven talking-head synthesis aims to generate photorealistic facial animations aligned with speech signals. While diffusion-based generative models have demonstrated strong synthesis quality, their computational cost often limits real-time or low-latency deployment.
TempoSyncDiff investigates a teacherβstudent distillation framework for latent diffusion models that reduces the number of denoising steps required during inference. The approach integrates identity anchoring, temporal regularization, and viseme-conditioned control to maintain identity consistency and reduce frame-to-frame instability.
The resulting distilled student model enables few-step latent diffusion inference, offering improved efficiency while preserving visual quality and temporal coherence.
Diffusion models are capable of impressive visual synthesis, although they are occasionally fond of taking their time. TempoSyncDiff explores whether teacherβstudent distillation can preserve much of the denoising capability of a stronger diffusion model while enabling fast few-step inference.
The framework incorporates:
The method aims to balance three key objectives:
Achieving all three simultaneously remains a respectable negotiation with both GPUs and the laws of optimization.
TempoSyncDiff follows a latent diffusion training and distillation pipeline:
Frame Compression
Video frames are encoded into a compact latent space using a lightweight Variational Autoencoder (VAE).
Teacher Diffusion Training
A latent diffusion teacher model learns to predict noise across the diffusion process.
Student Distillation
A smaller student denoiser is trained to approximate the teacher using fewer denoising steps.
Few-Step Inference
The distilled student model performs efficient video frame generation conditioned on identity and speech-derived tokens.
Additional training components include:
This repository provides:
This repository does not include:
TempoSyncDiff/
βββ README.md
βββ LICENSE
βββ .gitignore
βββ pyproject.toml
βββ requirements.txt
β
βββ checkpoints/
β βββ README.md
β
βββ configs/
β βββ pretrain/
β β βββ tinyvae_lrs3.yaml
β βββ train/
β β βββ teacher_lrs3.yaml
β β βββ student_distill_lrs3.yaml
β βββ infer/
β β βββ student_infer.yaml
β βββ eval/
β βββ denoise_eval.yaml
β βββ latency_cpu.yaml
β
βββ data/
β βββ README.md
β βββ lrs3/
β βββ hdtf/
β βββ manifests/
β βββ visemes/
β βββ examples/
β
βββ docs/
β βββ RELEASE_NOTES.md
β
βββ outputs/
β βββ logs/
β βββ samples/
β βββ metrics/
β βββ plots/
β βββ tables/
β
βββ scripts/
β βββ build_lrs3_manifest.py
β βββ pretrain_vae.py
β βββ train_teacher.py
β βββ distill_student.py
β βββ infer_student.py
β βββ evaluate.py
β
βββ src/
βββ temposyncdiff/
βββ data/
βββ losses/
βββ models/
βββ utils/
Experiments primarily involve:
Please obtain these datasets from their official distribution sources and comply with their licenses and usage terms.
data/lrs3/
βββ <talk_id>/
βββ 00001.mp4
βββ 00001.txt
βββ 00002.mp4
βββ 00002.txt
Optional viseme token files:
data/visemes/lrs3/
βββ <talk_id>/
βββ 00001.npy
βββ 00002.npy
If viseme tokens are unavailable, the pipeline automatically falls back to zero-token conditioning, allowing training and inference to remain executable.
Python 3.10+ is recommended.
git clone https://github.com/mazumdarsoumya/TempoSyncDiff.git
cd TempoSyncDiff
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install torch torchvision torchaudio
python -m pip install -r requirements.txt
python -m pip install -e .
git clone https://github.com/mazumdarsoumya/TempoSyncDiff.git
cd TempoSyncDiff
py -3 -m venv .venv
.\.venv\Scripts\Activate.ps1
python -m pip install --upgrade pip
python -m pip install torch torchvision torchaudio
python -m pip install -r requirements.txt
python -m pip install -e .
python scripts/build_lrs3_manifest.py \
--root data/lrs3 \
--out_dir data/manifests \
--viseme_root data/visemes/lrs3 \
--val_ratio 0.01 \
--test_ratio 0.01 \
--seed 123
Generated files:
data/manifests/lrs3_train.jsonl
data/manifests/lrs3_val.jsonl
data/manifests/lrs3_test.jsonl
python scripts/pretrain_vae.py \
--config configs/pretrain/tinyvae_lrs3.yaml
Outputs:
checkpoints/vae_pretrained.pt
python scripts/train_teacher.py \
--config configs/train/teacher_lrs3.yaml
Outputs:
checkpoints/teacher_best.pt
checkpoints/teacher_last.pt
python scripts/distill_student.py \
--config configs/train/student_distill_lrs3.yaml
Outputs:
checkpoints/student_best.pt
checkpoints/student_last.pt
Place a reference image:
data/examples/ref.jpg
Run:
python scripts/infer_student.py \
--config configs/infer/student_infer.yaml
Outputs:
outputs/samples/<run_name>/frames/
outputs/samples/<run_name>/sample.mp4
Evaluation compares denoised outputs against VAE reconstructions.
python scripts/evaluate.py \
--config configs/eval/denoise_eval.yaml
Output:
outputs/metrics/eval_report.json
checkpoints/data/examples/ref.jpg
Run:
python scripts/infer_student.py \
--config configs/infer/student_infer.yaml
If this work contributes to your research, please cite:
@article{mazumdar2026temposyncdiff,
title={TempoSyncDiff: Distilled Temporally-Consistent Diffusion for Low-Latency Audio-Driven Talking Head Generation},
author={Mazumdar, Soumya and Rakesh, Vineet Kumar},
journal={arXiv preprint arXiv:2603.06057},
year={2026},
doi={10.48550/arXiv.2603.06057}
}
This research benefited from institutional support and infrastructure provided by:
The authors also acknowledge the broader research community whose work continues to advance generative modeling.
Talking-head generation systems should be used responsibly.
Users are encouraged to:
Soumya Mazumdar reachme@soumyamazumdar.com
Vineet Kumar Rakesh vineet@vecc.gov.in