View on GitHub

Dutch Open Speech Recognition Benchmark

Results of Dutch ASR models, collected by the community

Results on the labelled audio of Broadcast News in the Netherlands

This data does not reflect the type of content on which the ASR will be applied to in terms of length of the audio, but it offers some rough estimates on the WER performance of the model on difficult speech conditions, particularly when it comes to the time alignment of the word-level timestamps with the reference files.

For each Whisper implementation, 2 variables have been modified:

The model version: large-v2 vs. large-v3 (to confirm the hypothesis from the UT evaluation)
The compute type: float16 vs. float32
For Huggingface, WhisperX, faster-whisper with batching: batch_size
- 2 for HF
- 64 for WhisperX float16, 16 for WhisperX float32
- 64 for faster-whisper with batching

The compute type refers to data types used to represent real numbers such as the weights of the Whisper model. In our case, float16, also known as half-precision, uses 16 bits to store a single floating-point number, whereas float32, known as single-precision, uses 32 bits to store a single floating-point number. It is known throughout various deep learning applications that float16 uses less memory and is faster, with the trade-off of loss in accuracy. However, in the case of Whisper, it has been reported that float16 leads to only a 0.1% increase in WER with the benefit of significantly reducing time and memory required to run the model.

Here is a matrix with WER results of the baseline implementation from OpenAI, as well as different, more optimized implementations:

Model\Parameters	large-v2 with `float16`	large-v2 with `float32`	large-v3 with `float16`	large-v3 with `float32`
OpenAI	11.1%	11.0%	12.9%	13.2%
Huggingface (`transformers`)	17.1%	16.9%	16.6%	16.6%
faster-whisper	10.3%	10.3%	11.8%	11.8%
faster-whisper w/ batching	10.3%	10.2%	12.4%	12.4%
WhisperX	12.3%	12.4%	13.0%	12.9%

And a matrix with the time spent in total by each implementation to load and transcribe the dataset:

Load+transcribe	large-v2 with `float16`	large-v2 with `float32`	large-v3 with `float16`	large-v3 with `float32`
OpenAI	36m:06s	32m:41s	42m:08s	30m:25s
Huggingface (`transformers`)	21m:48s	19m:13s	23m:22s	22m:02s
faster-whisper	11m:39s	22m:29s	11m:04s	24m:04s
faster-whisper w/ batching	4m:45s	8m:52s	4m:25s	8m:28s
WhisperX*	11m:34s	15m:15s	11m:01s	15m:04s

* For WhisperX, a separate alignment model based on wav2vec 2.0 has been applied in order to obtain word-level timestamps. Therefore, the time measured contains the time to load the model, time to transcribe, and time to align to generate timestamps. Speaker diarization has also been applied for WhisperX, which is measured separately and covered in this section.

Here’s also a matrix with the Real-Time Factor or RTF for short (defined as time to process all of the input divided by the duration of the input) for transcribing 2.23 hours of speech (rounded to 4 decimals):

RTF (process time/duration of audio)	large-v2 with `float16`	large-v2 with `float32`	large-v3 with `float16`	large-v3 with `float32`
OpenAI	0.2698	0.2443	0.3149	0.2273
Huggingface (`transformers`)	0.1629	0.1436	0.1746	0.1647
faster-whisper	0.0871	0.168	0.0827	0.1799
faster-whisper w/ batching	0.0355	0.0663	0.033	0.0633
WhisperX*	0.0864	0.114	0.0823	0.1126

Finally, a matrix with the maximum GPU memory consumption + maximum GPU power usage of each implementation (on average):

Max. memory / Max. power	large-v2 with `float16`	large-v2 with `float32`	large-v3 with `float16`	large-v3 with `float32`
OpenAI	10621 MiB / 240 W	10639 MiB / 264 W	10927 MiB / 238 W	10941 MiB / 266 W
Huggingface (`transformers`)*	15073 MiB / 141 W	12981 MiB / 215 W	14566 MiB / 123 W	19385 MiB / 235 W
faster-whisper	4287 MiB / 230 W	7776 MiB / 263 W	4292 MiB / 230 W	7768 MiB / 262 W
faster-whisper w/ batching*	5616 MiB / 243 W	9893 MiB / 264 W	5601 MiB / 242 W	9877 MiB / 264 W
WhisperX*	9947 MiB / 249 W	13940 MiB / 252 W	9944 MiB / 250 W	14094 MiB / 254 W

* For these implementations, batching is supported. Setting a higher batch_size will lead to faster inference at the cost of extra memory used.