🚀 Whisper-Large-V3-Distil-Multi7-v0.2
A multilingual distilled Whisper model with 2 decoder layers, supporting 7 European languages: English, French, Spanish, German, Italian, Portuguese, and Dutch.
This multilingual distilled Whisper model, with 2 decoder layers, offers support for seven European languages: English, French, Spanish, German, Italian, Portuguese, and Dutch. It was developed during the work on Distil-Large-v3.5. A significant feature is its native support for code-switching. The model can switch languages within a single segment transcription by automatically generating a new language token upon detecting a language change, as shown in the example below.
During training, the <|yue|>
language token was repurposed to serve as an automatic language detection token, enabling code-switching during inference. To use this feature, simply set the language parameter to cantonese
(which is the default).
However, the model's performance lags behind both the monolingual distilled version and Whisper-Large-v3-Turbo. Future work should explore better training procedures and potentially incorporate more data to effectively compress multilingual capabilities into a single model.
🚀 Quick Start
💻 Usage Examples
Basic Usage
import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_name_or_path = "bofenghuang/whisper-large-v3-distil-multi7-v0.2"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_name_or_path, torch_dtype=torch_dtype)
model.to(device)
dataset = load_dataset("bofenghuang/asr-dummy", "cs", split="test")
sample, text = dataset[0]["audio"], dataset[0]["text"]
print(text)
input_features = processor(
sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features
predicted_ids = model.generate(
input_features.to(device, dtype=torch_dtype),
max_new_tokens=128,
)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)[0]
print(transcription)
📚 Documentation
🔍 Evaluation
English
Model |
LIUM_tedlium |
mcv17 |
voxpopuli |
fleurs |
kensho_spgispeech |
librispeech-test_clean |
librispeech-test_other |
speechcolab_gigaspeech |
openai/whisper-large-v3 |
10.58 |
10.13 |
8.93 |
5.72 |
2.95 |
1.87 |
3.58 |
10.07 |
openai/whisper-large-v3-turbo |
10.20 |
11.74 |
11.78 |
6.13 |
2.95 |
1.98 |
3.94 |
10.11 |
distil-whisper/distil-large-v3 |
8.93 |
12.41 |
7.72 |
7.59 |
3.25 |
2.42 |
5.11 |
10.08 |
distil-whisper/distil-large-v3.5 |
8.65 |
11.07 |
7.54 |
6.74 |
2.86 |
2.28 |
4.94 |
9.84 |
bofenghuang/whisper-large-v3-distil-multi4-v0.2 |
8.88 |
11.33 |
7.60 |
6.97 |
3.03 |
2.51 |
5.24 |
10.12 |
bofenghuang/whisper-large-v3-distil-multi7-v0.2 |
9.36 |
11.32 |
7.65 |
7.02 |
2.99 |
2.46 |
5.24 |
10.06 |
French
Model |
mcv17 |
mls |
voxpopuli |
mtedx |
af_accented |
fleurs |
hf_dev_data_chunk30 |
hf_dev_data_sequential |
mtedx_chunk30 |
mtedx_sequential |
openai/whisper-large-v3 |
10.98 |
4.69 |
11.15 |
8.67 |
7.51 |
5.4 |
9.87 |
8.97 |
9 |
8.01 |
openai/whisper-large-v3-turbo |
12.41 |
5.1 |
12.21 |
9.87 |
8.37 |
5.48 |
10.12 |
9 |
8.49 |
8.39 |
bofenghuang/whisper_large_v3_distil_fr_v0.2 |
11.1 |
5 |
10.68 |
8.75 |
7.09 |
6.35 |
9.44 |
9.84 |
8.94 |
8.93 |
bofenghuang/whisper-large-v3-distil-multi4-v0.2 |
11.96 |
6.04 |
11.07 |
9.16 |
7.99 |
7.10 |
10.42 |
12.61 |
9.06 |
11.75 |
bofenghuang/whisper-large-v3-distil-multi7-v0.2 |
12.19 |
6.2 |
11.29 |
9.13 |
8.26 |
7.17 |
10.04 |
12.26 |
8.93 |
11.56 |
Spanish
Model |
mcv17 |
mls |
voxpopuli |
mtedx |
fleurs |
hf_dev_data_chunk30 |
hf_dev_data_sequential |
mtedx_chunk30 |
mtedx_sequential |
openai/whisper-large-v3 |
4.91 |
3.97 |
11.06 |
6.52 |
4.22 |
10.85 |
10.36 |
5.90 |
5.22 |
openai/whisper-large-v3-turbo |
5.74 |
4.41 |
16.02 |
6.66 |
4.59 |
11.55 |
10.68 |
6.46 |
5.41 |
bofenghuang/whisper-large-v3-distil-multi4-v0.2 |
5.58 |
4.34 |
8.52 |
7.43 |
5.20 |
11.26 |
13.43 |
5.69 |
8.95 |
bofenghuang/whisper-large-v3-distil-multi7-v0.2 |
5.70 |
4.35 |
8.55 |
7.56 |
5.15 |
11.45 |
13.54 |
5.84 |
8.27 |
German
Model |
mcv17 |
mls |
voxpopuli |
mtedx |
fleurs |
hf_dev_data_chunk30 |
hf_dev_data_sequential |
mtedx_chunk30 |
mtedx_sequential |
openai/whisper-large-v3 |
6.11 |
5.60 |
17.75 |
19.63 |
5.92 |
11.21 |
10.35 |
17.64 |
17.76 |
openai/whisper-large-v3-turbo |
7.45 |
6.43 |
20.48 |
20.00 |
6.45 |
10.57 |
9.70 |
18.04 |
18.37 |
bofenghuang/whisper-large-v3-distil-multi4-v0.2 |
7.31 |
6.45 |
12.41 |
21.48 |
8.20 |
11.04 |
13.55 |
19.54 |
21.76 |
bofenghuang/whisper-large-v3-distil-multi7-v0.2 |
7.57 |
6.67 |
12.42 |
21.95 |
8.28 |
11.21 |
13.84 |
19.90 |
21.67 |
Italian
Model |
mcv17 |
mls |
voxpopuli |
mtedx |
fleurs |
hf_dev_data_chunk30 |
hf_dev_data_sequential |
mtedx_chunk30 |
mtedx_sequential |
openai/whisper-large-v3 |
5.71 |
9.58 |
28.45 |
7.21 |
4.28 |
6.95 |
6.37 |
6.83 |
7.28 |
openai/whisper-large-v3-turbo |
6.77 |
10.64 |
30.69 |
7.41 |
4.69 |
6.88 |
6.52 |
7.98 |
7.37 |
bofenghuang/whisper_large_v3_distil_it_v0.2 |
6.15 |
9.22 |
17.27 |
7.52 |
5.26 |
6.06 |
6.99 |
7.84 |
8.42 |
bofenghuang/whisper-large-v3-distil-multi7-v0.2 |
6.78 |
11.42 |
17.53 |
8.07 |
5.68 |
7.04 |
9.51 |
7.51 |
10.47 |
Portuguese
Model |
mcv17 |
mls |
mtedx |
fleurs |
hf_dev_data_chunk30 |
hf_dev_data_sequential |
mtedx_chunk30 |
mtedx_sequential |
openai/whisper-large-v3 |
6.76 |
7.04 |
8.91 |
5.86 |
12.11 |
12.39 |
8.70 |
8.98 |
openai/whisper-large-v3-turbo |
7.66 |
6.64 |
8.84 |
6.11 |
12.42 |
11.62 |
10.97 |
9.04 |
bofenghuang/whisper-large-v3-distil-multi7-v0.2 |
8.31 |
6.75 |
10.11 |
7.10 |
12.74 |
14.97 |
9.64 |
11.78 |
Dutch
Model |
mcv17 |
mls |
voxpopuli |
fleurs |
openai/whisper-large-v3 |
4.51 |
66.95 |
23.35 |
6.99 |
openai/whisper-large-v3-turbo |
6.16 |
52.37 |
27.42 |
7.59 |
bofenghuang/whisper-large-v3-distil-multi7-v0.2 |
6.76 |
14.82 |
14.92 |
10.86 |
📄 License
This project is licensed under the MIT license.