View on GitHub

Dutch Open Speech Recognition Benchmark

Results of Dutch ASR models, collected by the community

Back to homepage

Detailed results for WhisperX

WhisperX is an implementation of Whisper with support for batching, word/character level time alignment using wav2vec 2.0, and speaker diarization.

There are 4 components/parts that we can define: loading Whisper + speaker diarization module, the transcriber (the part where Whisper runs to generate the text transcription without any timestamps), the aligner (generates the word-level timestamps using wav2vec 2.0), and the diarizer (identifies the speaker per segment and per word by assigning speaker IDs).

Due to the wav2vec 2.0 based aligner which doesn’t support aligning digits and currency symbols, numbers and currencies have been converted to their written form by setting suppress_numerals=True. In addition, the original aligner used in WhisperX, based on XLSR-53, has been replaced with an aligner based on XLS-R. This was done because the original aligner struggled with aligning some characters that are less common in Dutch, but part of the orthography of it (mainly accents on vowels). This might have led to more time spent on aligning compared to the XLSR-53 version, but an ablation study is planned for future work to confirm this hypothesis.


Two variables have been experimented with:

Labelled data

The batch size used is 64 for float16 and 16 for float32.

Here’s a matrix with the time spent by each component of WhisperX, using the various parameter configurations mentioned in the previous page, on the labelled data:

Configuration\Component Loading Transcriber Aligner Diarizer Total Total+Saving to JSON
large-v2 with float16 3.34s 4m:32s 7m:00s 2m:53s 14m:28s 14m:56s
large-v2 with float32 6.30s 8m:10s 6m:59s 2m:55s 18m:10s 18m:32s
large-v3 with float16 3.16s 4m:15s 6m:42s 2m:53s 13m:53s 14m:20s
large-v3 with float32 6.59s 8m:03s 6m:55s 2m:54s 17m:58s 18m:21s


And also a matrix with the maximum GPU memory consumption + maximum GPU power usage of each configuration (on average):

Max. memory / Max. power Transcriber Aligner Diarizer
large-v2 with float16 9947 MiB / 249 W 11667 MiB / 233 W 13011 MiB / 207 W
large-v2 with float32 13940 MiB / 252 W 14364 MiB / 229 W 15727 MiB / 206 W
large-v3 with float16 9944 MiB / 250 W 11697 MiB / 249 W 13011 MiB / 206 W
large-v3 with float32 14094 MiB / 254 W 14590 MiB / 248 W 15916 MiB / 207 W

Unlabelled data

The batch sizes used:

Here’s a matrix with the time spent by each component of WhisperX, using the various parameter configurations mentioned in the previous page, on the unlabelled data:

Configuration\Component Loading Transcriber Aligner Diarizer Total Total+Saving to JSON
large-v2 with float16 3.19s 13m:45s 11m:04s 11m:30s 36m:22s 36m:28s
large-v2 with float32 8.00s 20m:27s 11m:26s 11m:38s 43m:39s 43m:40s
large-v3 with float16 3.18s 14m:28s 11m:12s 11m:38s 37m:21s 37m:26s
large-v3 with float32 6.38s 20m:11s 11m:07s 11m:27s 42m:51s 42m:53s


And also a matrix with the maximum GPU memory consumption + maximum GPU power usage of each configuration (on average):

Max. memory / Max. power Transcriber Aligner Diarizer
large-v2 with float16 21676 MiB / 268 W 12541 MiB / 295 W 14477 MiB / 277 W
large-v2 with float32 21657 MiB / 279 W 15355 MiB / 300 W 17291 MiB / 276 W
large-v3 with float16 22425 MiB / 267 W 12542 MiB / 294 W 14477 MiB / 276 W
large-v3 with float32 21580 MiB / 279 W 15355 MiB / 298 W 17291 MiB / 276 W