Results on the labelled audio of Broadcast News in the Netherlands
This data does not reflect the type of content on which the ASR will be applied to in terms of length of the audio, but it offers some rough estimates on the WER performance of the model on difficult speech conditions, particularly when it comes to the time alignment of the word-level timestamps with the reference files.
For each Whisper implementation, 2 variables have been modified:
- The model version:
large-v2
vs.large-v3
(to confirm the hypothesis from the UT evaluation) - The compute type:
float16
vs.float32
- For Huggingface, WhisperX, faster-whisper with batching:
batch_size
2
for HF64
forWhisperX float16
,16
forWhisperX float32
64
for faster-whisper with batching
The compute type refers to data types used to represent real numbers such as the weights of the Whisper model. In our case, float16
, also known as half-precision, uses 16 bits to store a single floating-point number, whereas float32
, known as single-precision, uses 32 bits to store a single floating-point number. It is known throughout various deep learning applications that float16
uses less memory and is faster, with the trade-off of loss in accuracy. However, in the case of Whisper, it has been reported that float16
leads to only a 0.1% increase in WER with the benefit of significantly reducing time and memory required to run the model.
Here is a matrix with WER results of the baseline implementation from OpenAI, as well as different, more optimized implementations:
Model\Parameters | large-v2 with float16 |
large-v2 with float32 |
large-v3 with float16 |
large-v3 with float32 |
---|---|---|---|---|
OpenAI | 11.1% | 11.0% | 12.9% | 13.2% |
Huggingface (transformers ) |
17.1% | 16.9% | 16.6% | 16.6% |
faster-whisper | 10.3% | 10.3% | 11.8% | 11.8% |
faster-whisper w/ batching | 10.3% | 10.2% | 12.4% | 12.4% |
WhisperX | 12.3% | 12.4% | 13.0% | 12.9% |
And a matrix with the time spent in total by each implementation to load and transcribe the dataset:
Load+transcribe | large-v2 with float16 |
large-v2 with float32 |
large-v3 with float16 |
large-v3 with float32 |
---|---|---|---|---|
OpenAI | 36m:06s | 32m:41s | 42m:08s | 30m:25s |
Huggingface (transformers ) |
21m:48s | 19m:13s | 23m:22s | 22m:02s |
faster-whisper | 11m:39s | 22m:29s | 11m:04s | 24m:04s |
faster-whisper w/ batching | 4m:45s | 8m:52s | 4m:25s | 8m:28s |
WhisperX* | 11m:34s | 15m:15s | 11m:01s | 15m:04s |
* For WhisperX, a separate alignment model based on wav2vec 2.0 has been applied in order to obtain word-level timestamps. Therefore, the time measured contains the time to load the model, time to transcribe, and time to align to generate timestamps. Speaker diarization has also been applied for WhisperX, which is measured separately and covered in this section.
Here’s also a matrix with the Real-Time Factor or RTF for short (defined as time to process all of the input divided by the duration of the input) for transcribing 2.23 hours of speech (rounded to 4 decimals):
RTF (process time/duration of audio) | large-v2 with float16 |
large-v2 with float32 |
large-v3 with float16 |
large-v3 with float32 |
---|---|---|---|---|
OpenAI | 0.2698 | 0.2443 | 0.3149 | 0.2273 |
Huggingface (transformers ) |
0.1629 | 0.1436 | 0.1746 | 0.1647 |
faster-whisper | 0.0871 | 0.168 | 0.0827 | 0.1799 |
faster-whisper w/ batching | 0.0355 | 0.0663 | 0.033 | 0.0633 |
WhisperX* | 0.0864 | 0.114 | 0.0823 | 0.1126 |
Finally, a matrix with the maximum GPU memory consumption + maximum GPU power usage of each implementation (on average):
Max. memory / Max. power | large-v2 with float16 |
large-v2 with float32 |
large-v3 with float16 |
large-v3 with float32 |
---|---|---|---|---|
OpenAI | 10621 MiB / 240 W | 10639 MiB / 264 W | 10927 MiB / 238 W | 10941 MiB / 266 W |
Huggingface (transformers )* |
15073 MiB / 141 W | 12981 MiB / 215 W | 14566 MiB / 123 W | 19385 MiB / 235 W |
faster-whisper | 4287 MiB / 230 W | 7776 MiB / 263 W | 4292 MiB / 230 W | 7768 MiB / 262 W |
faster-whisper w/ batching* | 5616 MiB / 243 W | 9893 MiB / 264 W | 5601 MiB / 242 W | 9877 MiB / 264 W |
WhisperX* | 9947 MiB / 249 W | 13940 MiB / 252 W | 9944 MiB / 250 W | 14094 MiB / 254 W |
* For these implementations, batching is supported. Setting a higher batch_size
will lead to faster inference at the cost of extra memory used.