Performance and Benchmarks
In heads-up comparisons, Mod9 outperforms some of the industry’s most widely-used ASR solutions on both accuracy and performance metrics.
Word Error Rate, Precision
|Vendor||Service||Word Error Rate (%)||Transcription Precision (%)||Speaker Labels||Price ($/minute)||Automatic|
|VoiceBase||High Accuracy Transcription||15.3||91.5||✅||0.02||✅|
● This “Switchboard” test set has been widely used for evaluating speech recognition research. As of August 2017, the best reported performance is 5.1% WER by Microsoft Research.
● Word Error Rate is a measure of “verbatim” transcription accuracy that counts each word insertion, deletion, or substitution — including conversational “disfluencies” (e.g. repeated words).
● Transcript Precision is the fraction of output words that are correct, not penalizing missed words. This may be more intuitive than WER, and gives consistently high scores to human transcription.
● Speaker labels improve transcription quality and should be easily determined from dual-channel audio recordings, as formatted in the original files used for this evaluation. However, many systems only accept single-channel audio or will automatically downmix dual-channel audio; in these cases, the audio must be split into separate files and submitted as two requests, denoted as (duplex).
● Speaker diarization can be applied by Remeeting, and some of the other systems, to automatically identify speakers from audio that has been mixed down to a single channel, denoted as (mono).
● Punctuation and capitalization can be automatically added by Remeeting and some of the other systems. (This benchmark does not score punctuation or capitalization accuracy.)
● Help us to improve these references! We have discovered mistakes in the past and will share our corrections. Please report bugs to firstname.lastname@example.org