FacEDiT: Unified Talking Face Editing and Generation
via Facial Motion Infilling

1POSTECH · 2KAIST · 3UT Austin
FacEDiT Teaser

FacEDiT is a speech-conditioned Diffusion Transformer that learns to infill masked facial motion. This unified formulation supports both talking face editing where local motion is revised to match edited speech while preserving unedited regions, and from-scratch talking face generation, which synthesizes full facial motion from scratch.

Editing (V1): Local Editing + Stitching

Existing methods introduce visible boundary artifacts due to stitching. FacEDiT produces smooth, seamless transitions.

Editing (V2): Full Sequence Generation

Competing models alter unedited regions when regenerating full videos. FacEDiT preserves untouched facial motion.

From-scratch Talking Face Generation

All methods generate full facial motion solely from speech input.