Word Error Rate, Precision

Comparative Performance


Vendor Service Word Error Rate (%) Transcription Precision (%) Speaker Labels Price ($/minute) Automatic
TranscribeMe Verbatim Transcription 4.3 97.6 2.75
Mod9 ASR 7.1 95.6 0.01
Google Cloud Speech-to-Text 12.3 93.2 0.024
Amazon AWS Transcribe 12.9 93.3 0.024
VoiceBase High Accuracy Transcription 15.3 91.5 0.02
IBM Watson Speech-to-Text 15.7 90.7 0.02

● This “Switchboard” test set has been widely used for evaluating speech recognition research. As of August 2017, the best reported performance is 5.1% WER by Microsoft Research.
● Word Error Rate is a measure of “verbatim” transcription accuracy that counts each word insertion, deletion, or substitution — including conversational “disfluencies” (e.g. repeated words).
● Transcript Precision is the fraction of output words that are correct, not penalizing missed words. This may be more intuitive than WER, and gives consistently high scores to human transcription.
● Speaker labels improve transcription quality and should be easily determined from dual-channel audio recordings, as formatted in the original files used for this evaluation. However, many systems only accept single-channel audio or will automatically downmix dual-channel audio; in these cases, the audio must be split into separate files and submitted as two requests, denoted as (duplex).
● Speaker diarization can be applied by Remeeting, and some of the other systems, to automatically identify speakers from audio that has been mixed down to a single channel, denoted as (mono).
● Punctuation and capitalization can be automatically added by Remeeting and some of the other systems. (This benchmark does not score punctuation or capitalization accuracy.)
● Help us to improve these references! We have discovered mistakes in the past and will share our corrections. Please report bugs to

