Create
Drive a portrait or character still with real lip-sync — TTS the script via avatar voices or upload audio for tighter phoneme-level mouth motion. Identity-preserving, multilingual.
Script length paces articulation; expression colors the rest.
Give the model a tight line so syllable density paces the mouth-motion timing realistically. Expression bias — neutral warm, smiley upbeat, serious deadpan, or dramatic emphasis — steers eyebrow, cheek, and micro-expression without breaking facial identity. Language hint helps phoneme inference for non-English scripts. The system prompt refuses non-consensual celebrity use and keeps scripts professional. Consent and likeness rights are on you — this is a creative tool, not a forgery pass, and short clean scripts produce the most credible articulation.
Five inputs that make or break articulation quality.
When word-shape timing is the differentiator.
12-word teases
Punchy opening lines synced to mouth motion — for Reels, TikTok, and feed-paid tests.
Hero copy delivery
Single launch-day line spoken by founder or mascot, looped across paid social.
Localized hooks
Same brand mascot delivering region-specific lines for international campaigns.
Onboarding flows
Brand character greeting users with their actual welcome line, not a generic loop.
Lipsync moments where motion AND mouth-shape matter.
Phoneme-perfect lipsync is hard; this template targets convincing-enough.
Generic AI lipsync tools promise perfect synchronization and deliver uncanny mouth wobble. This template is honest: actual phoneme-perfect sync depends on the underlying video model, the source image quality, and the script characteristics. Short English lines fare best; long, multilingual, or fast-paced scripts get progressively harder. The system prompt asks for best-effort articulation correlated with syllable density, stable head and eye position, frozen background, and identity continuity with the source. The result is convincing for short-form social — not for film-grade dialogue dubbing.
Habits that compound across short-video production.
Audio-driven mode (when you upload an mp3/wav) gives phoneme-accurate sync — that's the best path. TTS mode is still real lip-sync (script → speech → mouth motion), but quality varies more by language and script length. Short English lines fare best in either mode.
Upload audio when you have a real recording (talent VO, podcast clip, your own voice memo) — that path uses dedicated audio-driven avatar models for tightest sync. Use TTS when you only have a script — the chosen voice is synthesized and lipped automatically.
Do not use non-consensual celebrity likenesses — policy and law apply. The system prompt refuses harassment and impersonation patterns.
Yes for major languages, with quality varying by model. English is strongest today; Spanish, French, and Japanese are reasonable; others may need short test runs. For non-English audio, upload your own recording for best results.
Length is bounded by the underlying avatar model — typically a few seconds. Jobs are async; poll status after submit.
Dedicated audio-driven and TTS-driven avatar models — `heygen-avatar-4` (default, both modes), `multitalk-avatar-tts` (text + ElevenLabs voice), `kling-avatar-v2-pro` (premium audio-driven), and `hunyuan-avatar` (high-fidelity audio-driven). Switch based on whether you have audio and how much premium fidelity you need.
Yes for content with proper rights and disclosure. Talent agreements should explicitly cover AI-generated lipsync; ad platforms may require AI labeling.
Hook in feed.
Pair with your static brand imagery and rights-cleared talent to get motion variants cheaply for paid social tests. The lipsync is the difference between scrollable and stoppable — when the mouth moves with the message, attention follows.