the cleanup tax — the hidden cost of every transcription tool

the invisible bill

when you pay for transcription, you pay a per-minute rate for the audio. $0.25 per minute on temi, $0.25 on rev AI, $1.50 on rev human, $0.17 on sonix premium with a $22/month subscription. these are the prices you see.

there's a second bill nobody sends you — the time you spend turning the delivered transcript into something you'd publish, file, cite, code, or hand to a producer. we call this the cleanup tax. it's invisible because it's billed in your hours, not the vendor's invoice. for most buyers it's also the larger of the two costs.

a 30-minute interview costs $7.50 on temi at the per-minute rate. the cleanup, on the same file, takes the median temi user between 7 and 12 minutes — call it 9 minutes. at a freelance journalist's billable rate of $80/hour, that's $12 in time. the invisible tax is bigger than the visible bill, on the same file, every time.

where the time goes

we ran six benchmark files through the major AI transcription tools (audiohighlight, temi, rev AI, sonix, otter) plus three open-source baselines, and timed the cleanup. four buckets consistently ate the time:

1. speaker label correction

diarization accuracy is the weakest part of every commercial AI transcription product in 2026. on a two-speaker interview with clean audio, the model gets it right roughly 90% of the time. on a three-speaker deposition with crosstalk, that drops to 70%. every wrong attribution requires a manual fix.

most tools make the manual fix one row at a time. you find the row, click the speaker name, type the correction. on a file with 80 speaker turns and 20% wrong, that's 16 fixes — maybe five minutes of work. on a deposition with 200 turns, it's twenty.

this scales linearly with file length. a tool with bulk-fix (relabel "Speaker 1" once and propagate to every Speaker 1 row) collapses this from minutes to seconds.

2. proper-noun and technical-vocabulary correction

AI transcription models guess at unfamiliar words. they substitute the closest phonetic neighbor in their training data. the famous "Bayesian" → "Beijing" substitution is the archetype. real-world failures we measured:

"habeas corpus" → "habeas corps" (legal interview)
researcher's surname → wrong by one letter, every time it appeared (academic interview)
company brand "Tigris" → "tigers" (founder interview)
study acronym "NHANES" → "n hands" (qualitative research)

each wrong substitution is fast to fix once you spot it. the cost is the spotting. you have to read the whole transcript carefully because the model is confidently producing plausible-looking nonsense. on a 30-minute interview with a dense vocabulary, this is 5–8 minutes of careful re-reading on top of any other cleanup.

per-account custom vocabulary — feeding the model a list of terms it should recognize — collapses this. so does per-corpus learning, where corrections you make on file one propagate to file two.

3. paragraph-break and formatting cleanup

AI transcripts arrive with one of two failure modes: a wall of text with no paragraph breaks (tools that don't model discourse structure), or every sentence in its own paragraph (tools that break too aggressively). neither is publishable.

fixing this is the most subjective task in the cleanup workflow. you're making editorial judgment calls about where a thought ends. on a 30-minute interview that's a few minutes; on a 60-minute deposition with 25-line page requirements, it can be 15.

4. quote verification

the most expensive bucket, and the one tools rarely acknowledge. for any transcript you'll cite, quote, file in court, or publish from, you have to verify the words are actually what was said. AI models are confidently wrong often enough that you can't skip this.

verification means going back to the audio. on most tools this means: open the audio file in a separate program, scrub the timeline to roughly the right place, listen, agree or disagree, type the correction. for a 30-minute file with (say) 20 quotes you care about, that's 8–10 minutes of context-switching.

a tool with click-word-to-replay-audio collapses this from minutes to seconds per quote. the editor knows where each word is in the audio; you click the word, the audio jumps, you decide. quote verification stops being a tax.

the math

on the benchmark corpus, average cleanup times across the tested tools, normalized to a 30-minute audio file:

temi: 8–12 minutes of cleanup on the median file. higher on technical content. 25–40% of audio length, consistent with user-reported numbers.
rev AI: 5–9 minutes. better proper-noun handling, similar diarization, better paragraph breaks. ~17–30% of audio length.
sonix: 4–7 minutes when their bulk speaker-fix is used. 13–25% of audio length. the editor saves real time but the subscription is the catch.
otter: 6–10 minutes for non-meeting audio. optimized for meeting bots; doesn't help much on file uploads.
open-source baselines: 7–12 minutes. the model is roughly the commercial mid-tier; the cleanup tax is high because there's no editor.

our target on the same files: under 90 seconds. under 5% of audio length. the difference is not the model — the model baseline is shared. the difference is the editor: bulk speaker fix, click-word-to-replay-audio for verification, per-account vocabulary that learns across files, structural paragraph-break detection that gets it right the first time.

why this matters more than the per-minute price

for an individual buyer making more than $40 an hour, the cleanup tax exceeds the visible bill on every file. the per-minute price you negotiated, the subscription you debated — those are the visible costs. the time the file costs you afterwards is the larger one.

the math gets stark at volume. a researcher running 20 one-hour interviews across a study, on temi: $300 in invoice cost (20 hours × 60 minutes × $0.25). cleanup tax at 30%: 6 hours of researcher time, billed at $60/hour = $360. the invoice was 45% of the actual cost.

this is why we benchmark on cleanup time, not WER. it's why we built the editor before we built the marketing page. and it's why our pricing model is flat — same per-minute rate for everyone, all features, no tier upgrade gating bulk speaker-fix. we're competing on the larger bill, not the smaller one.

how to estimate your own cleanup tax

a quick calibration:

run a 10-minute file through your current tool.
time how long it takes you to get from the delivered transcript to one you'd actually publish. include speaker fixes, proper nouns, paragraph breaks, and quote verification.
divide by 10. that's your cleanup-time-per-audio-minute ratio.
multiply by your typical file length and your hourly rate.

most users we've talked to land between 0.2 and 0.4 (20–40% of audio length spent cleaning). if your number is in that range, your tool is normal — and the cleanup tax is the larger of your two costs.

if your number is over 0.5 you have a tool problem worth fixing. if it's under 0.05 you either have astonishingly clean audio or a tool we'd like to know about.

the cleanup tax.

the invisible bill

where the time goes

1. speaker label correction

2. proper-noun and technical-vocabulary correction

3. paragraph-break and formatting cleanup

4. quote verification

the math

why this matters more than the per-minute price

how to estimate your own cleanup tax

related

the benchmark

WER is a useless buyer metric

vs temi

lifetime deal while we're in beta.