View on GitHub

Dutch Open Speech Recognition Benchmark

Results of Dutch ASR models, collected by the community

Back to homepage

Detailed results for WhisperX

WhisperX is an implementation of Whisper with support for batching, word/character level time alignment using wav2vec 2.0, and speaker diarization.

There are 4 components/parts that we can define: loading Whisper + speaker diarization module, the transcriber (the part where Whisper runs to generate the text transcription without any timestamps), the aligner (generates the word-level timestamps using wav2vec 2.0), and the diarizer (identifies the speaker per segment and per word by assigning speaker IDs).

Due to the wav2vec 2.0 based aligner which doesn’t support aligning digits and currency symbols, numbers and currencies have been converted to their written form by setting suppress_numerals=True. In addition, the original aligner used in WhisperX, based on XLSR-53, has been replaced with an aligner based on XLS-R. This was done because the original aligner struggled with aligning some characters that are less common in Dutch, but part of the orthography of it (mainly accents on vowels). This might have led to more time spent on aligning compared to the XLSR-53 version, but an ablation study is planned for future work to confirm this hypothesis.


Two variables have been experimented with:

Labelled data

The batch size used is 64 for float16 and 16 for float32.

Here’s a matrix with the time spent by each component of WhisperX, using the various parameter configurations mentioned in the previous page, on the labelled data:

Configuration\Component Loading Transcriber Aligner Diarizer Total Total+Saving to JSON
large-v2 with float16 79.98s 7m:16s 22m:32s 4m:22s 35m:30s 36m:14s
large-v2 with float32 82.74s 12m:31s 19m:28s 4m:14s 37m:36s 38m:10s
large-v3 with float16 78.02s 7m:07s 20m:03s 4m:17s 32m:45s 33m:26s
large-v3 with float32 53.35s 12m:35s 20m:13s 4m:16s 37m:57s 39m:04s


And also a matrix with the maximum GPU memory consumption + maximum GPU power usage of each configuration (on average):

Max. memory / Max. power Transcriber Aligner Diarizer
large-v2 with float16 9401 MiB / 181 W 11282 MiB / 171 W 12091 MiB / 122 W
large-v2 with float32 12714 MiB / 197 W 13937 MiB / 164 W 14925 MiB / 140 W
large-v3 with float16 9402 MiB / 186 W 11114 MiB / 178 W 12096 MiB / 138 W
large-v3 with float32 12721 MiB / 198 W 14009 MiB / 166 W 15916 MiB / 136 W

Unlabelled data

The batch sizes used:

Here’s a matrix with the time spent by each component of WhisperX, using the various parameter configurations mentioned in the previous page, on the unlabelled data:

Configuration\Component Loading Transcriber Aligner Diarizer Total Total+Saving to JSON
large-v2 with float16 77.50s 8m:58s 11m:14s 7m:34s 29m:03s 29m:59s
large-v2 with float32 57.55s 12m:28s 8m:49s 6m:53s 29m:07s 29m:22s
large-v3 with float16 62.37s 9m:00s 10m:26s 7m:29s 27m:57s 28m:09s
large-v3 with float32 7.47s 12m:15s 9m:14s 6m:55s 28m:31s 28m:34s


And also a matrix with the maximum GPU memory consumption + maximum GPU power usage of each configuration (on average):

Max. memory / Max. power Transcriber Aligner Diarizer
large-v2 with float16 19053 MiB / 257 W 12354 MiB / 289 W 14289 MiB / 275 W
large-v2 with float32 22013 MiB / 276 W 15200 MiB / 288 W 17135 MiB / 275 W
large-v3 with float16 19096 MiB / 256 W 12514 MiB / 290 W 14289 MiB / 275 W
large-v3 with float32 22042 MiB / 276 W 15200 MiB / 294 W 17135 MiB / 274 W