View on GitHub

Dutch Open Speech Recognition Benchmark

Results of Dutch ASR models, collected by the community

Go back

Results on the labelled audio of Broadcast News in the Netherlands

This data does not reflect the type of content on which the ASR will be applied to in terms of length of the audio, but it offers some rough estimates on the WER performance of the model on difficult speech conditions, particularly when it comes to the time alignment of the word-level timestamps with the reference files.


For each Whisper implementation, 2 variables have been modified:

The compute type refers to data types used to represent real numbers such as the weights of the Whisper model. In our case, float16, also known as half-precision, uses 16 bits to store a single floating-point number, whereas float32, known as single-precision, uses 32 bits to store a single floating-point number. It is known throughout various deep learning applications that float16 uses less memory and is faster, with the trade-off of loss in accuracy. However, in the case of Whisper, it has been reported that float16 leads to only a 0.1% increase in WER with the benefit of significantly reducing time and memory required to run the model.


Here is a matrix with WER results of the baseline implementation from OpenAI, as well as different, more optimized implementations:

Model\Parameters large-v2 with float16 large-v2 with float32 large-v3 with float16 large-v3 with float32
OpenAI 11.1% 11.0% 12.9% 13.2%
Huggingface (transformers) 17.1% 16.9% 16.6% 16.6%
faster-whisper 10.3% 10.3% 11.8% 11.8%
faster-whisper w/ batching 10.3% 10.2% 12.4% 12.4%
WhisperX 12.3% 12.4% 13.0% 12.9%


And a matrix with the time spent in total by each implementation to load and transcribe the dataset:

Load+transcribe large-v2 with float16 large-v2 with float32 large-v3 with float16 large-v3 with float32
OpenAI 36m:06s 32m:41s 42m:08s 30m:25s
Huggingface (transformers) 21m:48s 19m:13s 23m:22s 22m:02s
faster-whisper 11m:39s 22m:29s 11m:04s 24m:04s
faster-whisper w/ batching 4m:45s 8m:52s 4m:25s 8m:28s
WhisperX* 11m:34s 15m:15s 11m:01s 15m:04s

* For WhisperX, a separate alignment model based on wav2vec 2.0 has been applied in order to obtain word-level timestamps. Therefore, the time measured contains the time to load the model, time to transcribe, and time to align to generate timestamps. Speaker diarization has also been applied for WhisperX, which is measured separately and covered in this section.


Here’s also a matrix with the Real-Time Factor or RTF for short (defined as time to process all of the input divided by the duration of the input) for transcribing 2.23 hours of speech (rounded to 4 decimals):

RTF (process time/duration of audio) large-v2 with float16 large-v2 with float32 large-v3 with float16 large-v3 with float32
OpenAI 0.2698 0.2443 0.3149 0.2273
Huggingface (transformers) 0.1629 0.1436 0.1746 0.1647
faster-whisper 0.0871 0.168 0.0827 0.1799
faster-whisper w/ batching 0.0355 0.0663 0.033 0.0633
WhisperX* 0.0864 0.114 0.0823 0.1126


Finally, a matrix with the maximum GPU memory consumption + maximum GPU power usage of each implementation (on average):

Max. memory / Max. power large-v2 with float16 large-v2 with float32 large-v3 with float16 large-v3 with float32
OpenAI 10621 MiB / 240 W 10639 MiB / 264 W 10927 MiB / 238 W 10941 MiB / 266 W
Huggingface (transformers)* 15073 MiB / 141 W 12981 MiB / 215 W 14566 MiB / 123 W 19385 MiB / 235 W
faster-whisper 4287 MiB / 230 W 7776 MiB / 263 W 4292 MiB / 230 W 7768 MiB / 262 W
faster-whisper w/ batching* 5616 MiB / 243 W 9893 MiB / 264 W 5601 MiB / 242 W 9877 MiB / 264 W
WhisperX* 9947 MiB / 249 W 13940 MiB / 252 W 9944 MiB / 250 W 14094 MiB / 254 W

* For these implementations, batching is supported. Setting a higher batch_size will lead to faster inference at the cost of extra memory used.