the diarization fragility test — we tried to break speaker labeling on every commercial transcription tool

why this test exists

word-level transcription accuracy across the commercial category is now in the noise band. WER differences between serious tools are smaller than the inter-cleaner variance when the same humans clean the same files. the model competition for words is roughly over.

speaker diarization — figuring out which speaker said which words — is a different story. it's a separate model running on the same audio, it relies on different signal (voice embeddings, acoustic separation, conversational structure), and it's where the "AI transcription" experience actually breaks. when a transcript names the wrong speaker, the buyer's whole workflow gets the wrong shape: the paralegal mis-files a quote against the witness, the journalist attributes a statement to the wrong source, the qualitative researcher codes the participant's contribution against the interviewer's column.

we've been benchmarking transcription tools for a while. the word-level numbers converge. the diarization numbers don't. so we ran adversarial tests.

the test corpus

five files designed to break diarization in five different ways. each file is 6–8 minutes long, scripted, with a ground-truth speaker-attribution timeline. all human-recorded; the scripted nature means we know exactly who's supposed to be talking when.

file A — clean two-speaker baseline. two distinct voices (male, female), studio-quality audio, no crosstalk, ~2 second pauses between turns. this is the easy case. every tool should get close to 100%.
file B — heavy crosstalk. two speakers interrupting each other, with overlapping speech regions averaging 600ms per overlap. realistic for interview-style conversation, depositions with objections, podcasts with energetic guests.
file C — similar voices. two female speakers, similar age, similar regional accent. the audio is clean; the challenge is purely acoustic similarity. this is the case that breaks diarization most often in qualitative-research interviews.
file D — phone-quality audio. same two speakers as file A, but recorded over a phone line (8kHz, narrowband codec, occasional packet loss). this is what most journalism interviews and many depositions actually sound like.
file E — speaker swap. a four-speaker focus group where one speaker leaves mid-recording and a fifth speaker takes their seat. the tool has to figure out that "speaker 3 in the first half" and "speaker 3 in the second half" are different people. adversarial; rarely happens in practice; tells you what the model assumes.

the score

for each file and each tool, we measured speaker-attribution accuracy at the word level: what percentage of words got attributed to the correct speaker. we used the standard diarization-error-rate methodology (DER), but reported it as accuracy (1 - DER) because a non-technical reader can interpret it directly.

we also counted "speaker confusion events" — the number of turns where the tool flipped speakers across an entire monologue. these are the failures that compound into paralegal-time during cleanup, because once a speaker is mis-attributed for one whole turn, every subsequent attribution in that turn is also wrong.

results: file A (baseline)

everyone passes the easy case, mostly:

audiohighlight: 99.2% accuracy, 0 confusion events
rev AI: 98.8%, 0 confusion events
sonix: 98.4%, 1 confusion event
temi: 96.1%, 2 confusion events
otter: 97.3%, 1 confusion event
whisper-large-v3 + pyannote (open source): 98.7%, 0 confusion events

the baseline test is a sanity check. tools that fail this have real problems; tools that pass it are minimally competent. the spread is 96-99%, which sounds tight but on a 60-minute interview is the difference between 30 wrong attributions and 5.

results: file B (heavy crosstalk)

this is where the spread opens up:

audiohighlight: 87.4%, 2 confusion events
rev AI: 81.2%, 4 confusion events
sonix: 79.6%, 5 confusion events
temi: 68.4%, 11 confusion events
otter: 74.1%, 8 confusion events
open-source baseline: 82.7%, 4 confusion events

temi falls off a cliff on crosstalk audio. for interviews with energetic exchange — typical journalism, typical podcasts with strong guests, typical depositions with objections — temi gets one in three speaker labels wrong. the cleanup tax that shows up in user reviews is exactly this.

results: file C (similar voices)

this is the test that broke the most tools:

audiohighlight: 78.2%, 6 confusion events
rev AI: 71.4%, 9 confusion events
sonix: 69.3%, 11 confusion events
temi: 54.7%, 18 confusion events
otter: 62.1%, 13 confusion events
open-source baseline: 73.6%, 8 confusion events

two speakers with similar voices is the hardest realistic test for current diarization. on this file every tool's accuracy degrades; temi's drops below 55%, which is barely better than coin-flipping who said what. for a researcher interviewing two female colleagues with similar accents — a common case — the cleanup pass on temi is essentially re-labeling the entire transcript by hand.

results: file D (phone audio)

phone-quality audio reduces the acoustic information available to the diarization model:

audiohighlight: 91.8%, 3 confusion events
rev AI: 88.2%, 4 confusion events
sonix: 85.4%, 6 confusion events
temi: 79.8%, 8 confusion events
otter: 83.6%, 6 confusion events
open-source baseline: 86.9%, 5 confusion events

the spread on phone audio is similar to clean audio, just shifted down. for journalism interviews recorded over phone lines, every commercial tool delivers usable diarization but temi requires substantially more cleanup.

results: file E (speaker swap)

the adversarial test:

audiohighlight: 88.7%, the swap detected at 4:12 (ground truth: 4:08)
rev AI: 71.4%, swap merged into existing speaker
sonix: 67.2%, swap merged
temi: 51.3%, no swap detection
otter: 62.8%, partial detection
open-source baseline: 76.4%, swap detected at 4:24

the speaker-swap test is rare in practice. but the failure mode reveals something about the underlying assumption: most diarization models assume a fixed set of speakers throughout the file. when that assumption breaks, the model attributes new audio to the wrong existing speaker rather than detecting a new one. for focus groups with rotating participants, this is a real problem.

why audiohighlight wins this test

two reasons, neither magical. one, we use a diarization model trained on a corpus weighted toward conversational audio with crosstalk and similar voices, rather than the standard clean-conversation training data. two, the tool's editor surfaces speaker-attribution uncertainty to the cleaning user — every word gets a confidence score, and rows where the diarization model is uncertain appear in a "needs eyes" panel for fast review.

the editor doesn't make the model better. it makes the model's failures visible. that's the point. for paralegals and researchers reviewing transcripts, knowing where the model is uncertain is often more valuable than knowing the aggregate accuracy number.

what this means for buyers

if your audio looks like file A — clean studio recording, two distinct voices, no crosstalk — every commercial tool works. pick on price and editor.

if your audio looks like file B (crosstalk), file C (similar voices), or file D (phone quality) — which is most actually-recorded interview audio — the diarization differences matter. on a 60-minute interview, the gap between 90% and 70% diarization accuracy is the difference between five minutes of cleanup and forty.

if your audio is file E (speaker swap, focus group with rotating participants) — the safe assumption is that no tool handles it cleanly. plan for manual cleanup, choose a tool with bulk-relabel, and pick your moments.

methodology, limitations

all tests run on the published date. tool versions: temi (current web app, no version visible to users), rev AI (April 2026), sonix (premium tier, April 2026), otter (Pro plan, April 2026), whisper-large-v3 + pyannote 3.1 (current open-source releases). all files transcribed through each tool's standard ingest path. no per-tool tuning, no custom vocabulary, no pre-cleaning of the audio.

full corpus, ground-truth attributions, and per-tool delivered transcripts ship with the public benchmark on launch. if you think we mismeasured a tool, the protocol is: send your audio file with ground-truth attribution spreadsheet, the cleaning protocol you used, and your numbers. we re-run, and if our number is wrong we publish the correction.

the diarization fragility test.

why this test exists

the test corpus

the score

results: file A (baseline)

results: file B (heavy crosstalk)

results: file C (similar voices)

results: file D (phone audio)

results: file E (speaker swap)

why audiohighlight wins this test

what this means for buyers

methodology, limitations

related

the cleanup tax

WER is a useless buyer metric

benchmark

lifetime deal while we're in beta.