FacEDiT: Unified Talking Face Editing and Generation

FacEDiT is a speech-conditioned Diffusion Transformer that learns to infill masked facial motion. This unified formulation supports both talking face editing where local motion is revised to match edited speech while preserving unedited regions, and from-scratch talking face generation, which synthesizes full facial motion from scratch.

We present results for talking face editing and from-scratch talking face generation. Because publicly available editing methods are limited, we repurpose existing from-scratch generation models for editing in two settings (V1 and V2). We also evaluate from-scratch generation, where all models generate the full video from the same input speech, enabling a fairer comparison.
Editing (V1): Each model receives only the edited portion of the speech, excluding unedited segments. The model generates the corresponding video segment, which is then stitched back into the original video.
Editing (V2): Each model generates the entire video from the edited speech.
From-scratch: Entire facial motion is synthesized from speech alone.