what WER is, and why everyone uses it
word error rate is the percentage of words a model gets wrong against a reference transcript. you align the two, count the insertions, deletions, and substitutions, divide by the reference length, and you get a number — usually between 5% and 25% on conversational audio in 2026. lower is better. that's the whole metric.
WER is genuinely useful inside the lab. it's the metric that made every meaningful jump in speech recognition since the 1990s — DARPA evaluations, Switchboard, LibriSpeech, the Hugging Face Open ASR Leaderboard. when you're comparing two model checkpoints to decide which one to ship, WER is exactly the right thing to look at.
the problem is that WER then escapes the lab and shows up on marketing pages. "99% accurate." "industry-leading WER." "near- human accuracy." these claims aren't lies — they're true and irrelevant.
what WER doesn't capture
a transcript with 5% WER and a transcript with 12% WER can take the same amount of time to clean up. they can also differ by an order of magnitude. the reason is that WER weights every word equally, and your time doesn't.
1. it doesn't know which words matter
a model that misspells "the" three times will register the same WER hit as a model that misspells the surname of the witness you're deposing. one is invisible. the other costs you ten minutes of find-and-replace and a re-read to make sure you got them all. WER thinks they're the same error.
this is why proper-noun handling, custom vocabulary, and per-account terms lists move actual cleanup time more than a two-point WER improvement does. they target the words whose mistakes are expensive.
2. it ignores speaker attribution
most WER benchmarks score word-level accuracy with the speaker column stripped. that's because evaluating diarization is hard and the standards are fragmented. but for buyers, getting the speaker labels wrong is often the most expensive failure: a transcript with the right words but the wrong speakers attached to them is, depending on your use case, useless or actively dangerous.
a paralegal cleaning up a deposition transcript spends more time fixing speaker labels than fixing words. a qualitative researcher coding interviews has to know which participant said what. a journalist building a quote needs the speaker right or they'll print a defamation. WER says nothing about this.
3. it ignores formatting and structure
paragraph breaks. capitalization. punctuation. the "uh" and "um" decisions (verbatim or clean? a real choice with downstream consequences). when a transcript runs together as one wall of text or breaks every sentence into its own paragraph, the cleanup is the same WER but the time cost is wildly different.
formatting is also where editorial style enters. court reporting wants verbatim with non-verbal cues marked. journalism wants clean prose without "you know"s. academic transcription wants everything, in the right notation. WER is neutral on all of it.
4. it doesn't measure what you do after
a WER number is computed against a reference transcript that someone already produced. but for buyers, the question isn't how the AI compares to a perfect human transcriber — it's how long it takes you to turn the AI output into something you'd actually use.
two tools can have identical WER and still differ by an hour of your time per file. the difference is the editor that hosts the transcript afterwards: how speaker labels propagate, how you verify a quote against audio, how you fix proper nouns at the corpus level instead of one at a time. that's not a model property. that's a workflow property. WER doesn't see it.
what we measure instead
we benchmark cleanup-time-per-audio-hour. it's a single number that captures everything WER ignores: which words got fixed, how long the speakers took, how long it took to verify quotations against audio, how the editor helped or didn't.
the methodology, in five lines:
- fixed corpus. six audio files representing the six jobs-to-be-done that drive most paid transcription. published, reproducible.
- fixed cleaning protocol. speaker-label fixes counted per turn. proper-noun fixes counted. paragraph-break corrections counted. quote verification on a random sample of 20 quotes per file.
- blind cleaners. the same humans clean every transcript without being told which tool produced which file.
- same humans across tools. removes the per-cleaner skill variance.
- ratio reported. cleanup minutes per audio minute, expressed as a percentage. target for our tool: under 5%. temi's reported median across users: 25–40%.
the full methodology, the corpus, the reference transcripts, the cleaning harness, and the per-tool delivered transcripts all ship with the benchmark page on launch. anyone can reproduce or extend it.
what to ask vendors instead of "what's your WER"
- show me a 30-minute file processed end-to-end, including the time to fix speaker labels. in seconds. that's the number.
- what happens when the model misspells a proper noun on the first occurrence? do I have to fix it everywhere it appears, or once? the answer separates a workspace from a Word document.
- can I click a word and hear the second of audio it came from? if the answer is "scrub the timeline in a separate program," verification will eat your day.
- what's the export format for <your specific tool>? NVivo, ATLAS.ti, deposition format, Jefferson notation, Premiere SRT — the format you actually use is more interesting than ".docx export."
- what does private-mode mean in your product, exactly? "encrypted in transit" is not the same as "audio never leaves my device." vendors who blur this are blurring it on purpose.
why this matters for the category
the AI transcription category has been competing on WER for a decade. that's a model competition. it's roughly over — Whisper-large-v3 and its descendants set a floor in 2024 that most commercial vendors now meet. WER differences between serious tools are now in the noise.
the actual remaining variance — the variance buyers feel — lives in the editor, the export formats, the speaker handling, the privacy posture, and the pricing model. the next decade of transcription will be a workflow competition, not a model competition. the marketing pages haven't caught up yet.
if you're shopping for a transcription tool, ignore the WER claim. ask for a 30-minute file processed end-to-end, watch how long the cleanup takes, and pick the one that finishes first.