View on GitHub

Dutch Open Speech Recognition Benchmark

Results of Dutch ASR models, collected by the community

Go back

Jasmin Dutch results

Here is a matrix with WER results of the baseline model, Kaldi_NL, as well as different models tested on Dutch read speech:

Model\Dataset Native Children Native Teenagers Non-native Minors Non-native Adults Native Elderly
Kaldi_NL 28.1% 16.2% 43.6% 45.3% 20.9%
Whisper v2 22.6% 18.0% 36.5% 37.3% 22.2%
Whisper v3 34.2% 29.4% 50.4% 58.5% 34.4%
Whisper v2 w/ VAD 20.1% 12.4% 30.2% 33.4% 14.9%
Whisper v3 w/ VAD 34.7% 27.5% 46.7% 53.0% 30.2%
faster-whisper v2 20.3% 11.3% 29.9% 30.6% 13.7%
faster-whisper v3 28.1% 25.2% 50.9% 62.6% 27.6%
faster-whisper v2 w/ VAD 19.1% 11.1% 29.5% 30.0% 12.8%
faster-whisper v3 w/ VAD 27.5% 22.4% 42.6% 49.4% 25.2%
XLS-R FT on Dutch 22.4% 13.3% 33.8% 36.1% 17.2%
MMS - 102 languages 31.6% 20.3% 54.2% 55.1% 23.9%
MMS - 1162 languages 28.9% 20.0% 50.1% 54.0% 28.3%


And for Dutch conversational speech:

Model\Dataset Native Children Native Teenagers Non-native Minors Non-native Adults Native Elderly
Kaldi_NL 55.4% 62.4% 69.1% 60.0% 44.0%
Whisper v2 95.8% 107.4% 124.0% 88.1% 61.9%
Whisper v3 75.7% 72.6% 94.3% 84.2% 58.4%
Whisper v2 w/ VAD 32.6% 29.4% 42.6% 54.0% 33.1%
Whisper v3 w/ VAD 40.3% 31.7% 57.1% 63.2% 41.3%
faster-whisper v2 58.9% 65.8% 107.4% 77.7% 39.9%
faster-whisper v3 85.8% 68.3% 84.4% 84.5% 51.4%
faster-whisper v2 w/ VAD 28.2% 22.9% 39.2% 51.4% 26.8%
faster-whisper v3 w/ VAD 34.4% 28.6% 48.7% 58.2% 33.6%
XLS-R FT on Dutch 60.2% 62.2% 70.5% 59.1% 47.0%
MMS - 102 languages 79.8% 79.9% 90.7% 80.5% 56.4%
MMS - 1162 languages 82.4% 87.9% 94.5% 83.3% 59.9%


And its corresponding matrix with the time spent in total by each model to evaluate Dutch read speech:

Model\Dataset Native Children Native Teenagers Non-native Minors Non-native Adults Native Elderly
Kaldi_NL 0h:30m:21s 0h:23m:25s 0h:27m:51s 0h:27m:17s 0h:29m:36s
Whisper v2 2h:05m:29s 1h:53m:11s 1h:35m:28s 1h:24m:41s 2h:04m:35s
Whisper v3 3h:12m:26s 2h:27m:52s 6h:13m:28s* 3h:04m:32s 3h:09m:49s
Whisper v2 w/ VAD 2h:14m:40s 1h:51m:46s 1h:49m:48s 4h:18m:51s* 2h:08m:02s
Whisper v3 w/ VAD 2h:58m:24s 2h:19m:43s 2h:38m:23s 2h:31m:35s 2h:47m:33s
faster-whisper v2 0h:30m:45s 0h:26m:48s 0h:23m:48s 0h:21m:55s 0h:30m:02s
faster-whisper v3 0h:41m:58s 0h:38m:13s 0h:48m:28s 0h:55m:48s 0h:44m:12s
faster-whisper v2 w/ VAD 0h:32m:55s 0h:27m:16s 0h:25m:51s 0h:21m:58s 0h:32m:09s
faster-whisper v3 w/ VAD 0h:40m:33s 0h:31m:45s 0h:37m:36s 0h:37m:11s 0h:38m:00s
XLS-R FT on Dutch 0h:35m:18s 0h:27m:33s 0h:32m:39s 0h:31m:49s 0h:39m:05s
MMS - 102 languages 0h:17m:59s 0h:13m:22s 0h:16m:01s 0h:15m:38s 0h:17m:35s
MMS - 1162 languages 0h:17m:46s 0h:13m:22s 0h:16m:00s 0h:15m:37s 0h:17m:35s


And for Dutch conversational speech:

Model\Dataset Native Children Native Teenagers Non-native Minors Non-native Adults Native Elderly
Kaldi_NL 0h:16m:09s 0h:10m:29s 0h:11m:28s 0h:19m:09s 0h:21m:32s
Whisper v2 1h:23m:33s 1h:12m:52s 1h:13m:23s 1h:31m:20s 2h:39m:12s
Whisper v3 2h:30m:51s 1h:59m:19s 2h:13m:38s 3h:07m:55s 4h:39m:06s*
Whisper v2 w/ VAD 0h:48m:52s 0h:42m:17s 0h:37m:07s 1h:02m:36s 1h:47m:58s
Whisper v3 w/ VAD 1h:05m:32s 0h:37m:53s 0h:55m:46s 1h:38m:03s 2h:16m:09s
faster-whisper v2 0h:22m:10s 0h:17m:19s 0h:20m:16s 0h:23m:23s 0h:34m:06s
faster-whisper v3 0h:54m:15s 0h:32m:13s 0h:34m:35s 0h:55m:02s 1h:12m:22s
faster-whisper v2 w/ VAD 0h:09m:59s 0h:07m:37s 0h:07m:57s 0h:13m:32s 0h:22m:31s
faster-whisper v3 w/ VAD 0h:13m:43s 0h:07m:17s 0h:09m:57s 0h:22m:45s 0h:25m:52s
XLS-R FT on Dutch 0h:42m:20s 0h:24m:19s 0h:26m:52s 0h:36m:42s 0h:48m:26s
MMS - 102 languages 0h:18m:02s 0h:14m:02s 0h:14m:01s 0h:18m:59s 0h:25m:34s
MMS - 1162 languages 0h:17m:55s 0h:13m:56s 0h:13m:59s 0h:18m:54s 0h:25m:24s

* Performance might have been impacted by other processes from other users running on the same GPU since the hardware is available via a cluster system. Future work includes rerunning these specific experiments.

Jasmin Flemish results

Matrix with WER results for Flemish read speech:

Model\Dataset Native Children Native Teenagers Non-native Minors Non-native Adults Native Elderly
Kaldi_NL 59.2% 33.5% 51.3% 43.3% 24.7%
faster-whisper v2 42.4% 11.7% 19.9% 21.0% 16.7%
faster-whisper v3 57.2% 30.6% 44.4% 41.1% 38.7%
faster-whisper v2 w/ VAD 41.8% 11.6% 19.4% 20.5% 14.4%
faster-whisper v3 w/ VAD 56.2% 26.7% 38.4% 50.7% 33.6%
XLS-R FT on Dutch 47.4% 13.3% 30.1% 26.8% 16.4%
MMS - 102 languages 55.3% 22.4% 43.0% 37.0% 23.0%
MMS - 1162 languages 49.2% 21.8% 34.9% 35.8% 22.3%


And for Flemish conversational speech:

Model\Dataset Native Children Native Teenagers Non-native Minors Non-native Adults Native Elderly
Kaldi_NL 66.5% 49.8% 66.2% 64.4% 47.4%
faster-whisper v2 87.6% 51.7% 76.1% 67.3% 45.4%
faster-whisper v3 90.5% 65.2% 100.4% 79.9% 68.3%
faster-whisper v2 w/ VAD 28.7% 24.3% 38.5% 49.3% 30.6%
faster-whisper v3 w/ VAD 46.0% 37.7% 57.9% 57.9% 44.6%
XLS-R FT on Dutch 73.2% 62.2% 68.1% 52.2% 47.8%
MMS - 102 languages 86.7% 52.3% 87.8% 78.2% 56.4%
MMS - 1162 languages 86.1% 68.0% 86.3% 76.7% 60.8%


And its corresponding matrix with the time spent in total by each model to evaluate Flemish read speech:

Model\Dataset Native Children Native Teenagers Non-native Minors Non-native Adults Native Elderly
Kaldi_NL 0h:15m:58s 0h:16m:03s 0h:25m:11s 0h:15m:46s 0h:29m:36s
faster-whisper v2 0h:09m:30s 0h:20m:12s 0h:18m:03s 0h:12m:09s 0h:14m:31s
faster-whisper v3 0h:14m:53s 0h:24m:33s 0h:29m:19s 0h:21m:47s 0h:23m:58s
faster-whisper v2 w/ VAD 0h:21m:27s 0h:27m:16s 0h:19m:09s 0h:13m:24s 0h:15m:29s
faster-whisper v3 w/ VAD 0h:13m:17s 0h:23m:29s 0h:23m:14s 0h:26m:40s 0h:19m:54s
XLS-R FT on Dutch 0h:11m:18s 0h:20m:03s 0h:22m:05s 0h:16m:28s 0h:13m:00s
MMS - 102 languages 0h:05m:47s 0h:09m:09s 0h:10m:06s 0h:07m:37s 0h:08m:04s
MMS - 1162 languages 0h:05m:47s 0h:09m:07s 0h:10m:06s 0h:07m:37s 0h:08m:04s


And for Flemish conversational speech:

Model\Dataset Native Children Native Teenagers Non-native Minors Non-native Adults Native Elderly
Kaldi_NL 0h:07m:09s 0h:07m:36s 0h:08m:37s 0h:10m:51s 0h:14m:45s
faster-whisper v2 0h:12m:48s 0h:10m:45s 0h:14m:34s 0h:11m:58s 0h:34m:06s
faster-whisper v3 0h:24m:08s 0h:26m:42s 0h:28m:56s 0h:27m:12s 0h:31m:16s
faster-whisper v2 w/ VAD 0h:05m:41s 0h:07m:03s 0h:07m:01s 0h:08m:11s 0h:09m:45s
faster-whisper v3 w/ VAD 0h:06m:44s 0h:08m:23s 0h:10m:08s 0h:13m:52s 0h:10m:40s
XLS-R FT on Dutch 0h:20m:36s 0h:16m:58s 0h:20m:55s 0h:17m:47s 0h:19m:34s
MMS - 102 languages 0h:10m:55s 0h:09m:10s 0h:11m:00s 0h:09m:43s 0h:10m:33s
MMS - 1162 languages 0h:10m:06s 0h:09m:09s 0h:10m:42s 0h:09m:36s 0h:10m:15s