Question 1

Which models power it?

Accepted Answer

Dedicated speech-to-text models — `gpt-4o-transcribe` (default, OpenAI flagship), `elevenlabs-scribe-v2` (multilingual, diarization, keyterm biasing), `wizper` (Whisper Large v3 on Fal — fast, multilingual translation), and `fal-speech-to-text` (NVIDIA Canary — cheapest, English-leaning). All read your uploaded audio directly; switch when one struggles with a particular accent or domain.

Question 2

Is the output court-grade or legally admissible?

Accepted Answer

No — always review and have a qualified human transcriber certify before legal or compliance archives. This tool helps draft, not certify.

Question 3

Will it invent dialogue that nobody said?

Accepted Answer

Dedicated transcription models transcribe audio waveforms directly — they do not hallucinate text the way LLM chat models can. Quality drops on muddy audio, but the failure mode is missed words rather than invented sentences.

Question 4

Can it identify speakers by name?

Accepted Answer

Diarization gives best-effort Speaker 1 / Speaker 2 labels (when you enable the toggle and pick an ElevenLabs Scribe model). It cannot identify real names from voice alone — substitute names manually from your attendee list afterward.

Question 5

How does it handle multilingual recordings?

Accepted Answer

Leave language on auto-detect for genuine code-switching; Wizper and ElevenLabs Scribe handle 99+ languages with reasonable accuracy. For best results, set language explicitly when you know it — the auto-detect path adds a small amount of inference time.

Question 6

What do the vocabulary hints do?

Accepted Answer

Comma-separated names, brands, and jargon are biased into the recognizer. ElevenLabs Scribe v2 reads them as keyterms; OpenAI Transcribe reads them as the `prompt` parameter. Both improve accuracy on tricky proper nouns. Other models ignore the hints silently.

Question 7

How do I get sharper accuracy?

Accepted Answer

Cleaner audio source (record at higher bitrates, single mic per speaker), explicit language selection when known, vocabulary hints for domain jargon, and the ElevenLabs path when speaker structure matters more than raw cost. There is no shortcut around bad audio.

AI Transcription

From messy audio to quotable text

How to get a transcript you can actually quote

What each model is good for

gpt-4o-transcribe

elevenlabs-scribe-v2

wizper

fal-speech-to-text

Best for

Why dedicated transcription beats LLM 'analysis'

Pro tips for cleaner transcripts

Transcription FAQ