writing · 6 min read

WER is a useless buyer metric.

word error rate is the headline number every transcription vendor markets against. it tells you nothing about whether the transcript will save you time.

what WER is, and why everyone uses it

word error rate is the percentage of words a model gets wrong against a reference transcript. you align the two, count the insertions, deletions, and substitutions, divide by the reference length, and you get a number — usually between 5% and 25% on conversational audio in 2026. lower is better. that's the whole metric.

WER is genuinely useful inside the lab. it's the metric that made every meaningful jump in speech recognition since the 1990s — DARPA evaluations, Switchboard, LibriSpeech, the Hugging Face Open ASR Leaderboard. when you're comparing two model checkpoints to decide which one to ship, WER is exactly the right thing to look at.

the problem is that WER then escapes the lab and shows up on marketing pages. "99% accurate." "industry-leading WER." "near- human accuracy." these claims aren't lies — they're true and irrelevant.

what WER doesn't capture

a transcript with 5% WER and a transcript with 12% WER can take the same amount of time to clean up. they can also differ by an order of magnitude. the reason is that WER weights every word equally, and your time doesn't.

1. it doesn't know which words matter

a model that misspells "the" three times will register the same WER hit as a model that misspells the surname of the witness you're deposing. one is invisible. the other costs you ten minutes of find-and-replace and a re-read to make sure you got them all. WER thinks they're the same error.

this is why proper-noun handling, custom vocabulary, and per-account terms lists move actual cleanup time more than a two-point WER improvement does. they target the words whose mistakes are expensive.

2. it ignores speaker attribution

most WER benchmarks score word-level accuracy with the speaker column stripped. that's because evaluating diarization is hard and the standards are fragmented. but for buyers, getting the speaker labels wrong is often the most expensive failure: a transcript with the right words but the wrong speakers attached to them is, depending on your use case, useless or actively dangerous.

a paralegal cleaning up a deposition transcript spends more time fixing speaker labels than fixing words. a qualitative researcher coding interviews has to know which participant said what. a journalist building a quote needs the speaker right or they'll print a defamation. WER says nothing about this.

3. it ignores formatting and structure

paragraph breaks. capitalization. punctuation. the "uh" and "um" decisions (verbatim or clean? a real choice with downstream consequences). when a transcript runs together as one wall of text or breaks every sentence into its own paragraph, the cleanup is the same WER but the time cost is wildly different.

formatting is also where editorial style enters. court reporting wants verbatim with non-verbal cues marked. journalism wants clean prose without "you know"s. academic transcription wants everything, in the right notation. WER is neutral on all of it.

4. it doesn't measure what you do after

a WER number is computed against a reference transcript that someone already produced. but for buyers, the question isn't how the AI compares to a perfect human transcriber — it's how long it takes you to turn the AI output into something you'd actually use.

two tools can have identical WER and still differ by an hour of your time per file. the difference is the editor that hosts the transcript afterwards: how speaker labels propagate, how you verify a quote against audio, how you fix proper nouns at the corpus level instead of one at a time. that's not a model property. that's a workflow property. WER doesn't see it.

what we measure instead

we benchmark cleanup-time-per-audio-hour. it's a single number that captures everything WER ignores: which words got fixed, how long the speakers took, how long it took to verify quotations against audio, how the editor helped or didn't.

the methodology, in five lines:

the full methodology, the corpus, the reference transcripts, the cleaning harness, and the per-tool delivered transcripts all ship with the benchmark page on launch. anyone can reproduce or extend it.

what to ask vendors instead of "what's your WER"

why this matters for the category

the AI transcription category has been competing on WER for a decade. that's a model competition. it's roughly over — Whisper-large-v3 and its descendants set a floor in 2024 that most commercial vendors now meet. WER differences between serious tools are now in the noise.

the actual remaining variance — the variance buyers feel — lives in the editor, the export formats, the speaker handling, the privacy posture, and the pricing model. the next decade of transcription will be a workflow competition, not a model competition. the marketing pages haven't caught up yet.

if you're shopping for a transcription tool, ignore the WER claim. ask for a 30-minute file processed end-to-end, watch how long the cleanup takes, and pick the one that finishes first.

related

lifetime deal while we're in beta.

join the waitlist to get a lifetime deal — your first month free, plus 50% off forever. private invite when we ship; no drip campaign.