Jasmin - Read speech (comp_q)
Preprocessing
The encoding used in the dataset for the transcriptions is latin_1. In order for the evaluation tool to work, I converted the encoding to UTF-8.
For the Flemish subset, one speaker (V000055
) does not belong to any of the 5 speaker groups according to the metadata. Therefore, this speaker and their files (fv160041
and fv170041
) have been excluded from the evaluation.
Postprocessing
A large number of insertions was encountered when evaluating Whisper. This was due to time misalignment at the start of segments. This was addressed by adjusting the start_time
of the first word of a segment to end_time - 0.1s
.
Same issue as described here was encountered for Jasmin too, where Whisper v3 would sometimes output segments such as “Beeld &”. It was addressed with the same method. Normalization and variations have also been applied similar to N-Best.
Jasmin - HMI speech (comp_p)
Same steps were applied as for comp_q
. One additional step is described below:
Preprocessing
The machine voice is still audible in the human speaker’s channel. To address this, a script to silence the time segments where the human would not speak was used. The gaps between the human-spoken segments are annotated using inter_segment_gap
as an identifier and those parts were therefore silenced. It uses pydub’s AudioSegment.silent()
function. The script also converts the audio to mono by saving only the human channel.