Whisper-large-v3-distil-multi7-v0.2 Open-source Model - Free Support for Automatic Speech Recognition and Code-switching of 7 European Languages

Whisper Large V3 Distil Multi7 V0.2

Developed by bofenghuang

A distilled multilingual Whisper model supporting automatic speech recognition for 7 European languages with code-switching capability

Speech Recognition

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Multilingual speech recognition #Code-switching #Lightweight distillation

Downloads 119

Release Time : 12/5/2024

Model Overview

This is a distilled model based on Whisper-Large-v3, optimized for automatic speech recognition in 7 European languages (English, French, Spanish, German, Italian, Portuguese, and Dutch). The model features 2 decoder layers and specifically supports code-switching functionality, automatically detecting and processing multilingual mixed speech inputs.

Model Features

Multilingual support

Supports speech recognition for 7 European languages (English, French, Spanish, German, Italian, Portuguese, and Dutch)

Code-switching capability

Automatically detects language changes in speech and generates corresponding language tags for seamless multilingual transcription

Efficient distilled architecture

Retains only 2 decoder layers, improving inference efficiency while maintaining good performance

Model Capabilities

Automatic speech recognition

Multilingual transcription

Code-switching detection

Speech-to-text conversion

Use Cases

Multilingual transcription

Multilingual meeting minutes

Automatically transcribe meeting recordings containing multiple languages

Accurately identifies language switches and generates corresponding language texts

Multilingual media content processing

Process media content such as podcasts and videos containing multiple languages

Generates transcribed texts with language tags

Speech analysis

Multilingual speech data analysis

Analyze speech datasets containing multiple languages

Provides accurate text transcription for subsequent analysis

🚀 Whisper-Large-V3-Distil-Multi7-v0.2

A multilingual distilled Whisper model with 2 decoder layers, supporting 7 European languages: English, French, Spanish, German, Italian, Portuguese, and Dutch.

This multilingual distilled Whisper model, with 2 decoder layers, offers support for seven European languages: English, French, Spanish, German, Italian, Portuguese, and Dutch. It was developed during the work on Distil-Large-v3.5. A significant feature is its native support for code-switching. The model can switch languages within a single segment transcription by automatically generating a new language token upon detecting a language change, as shown in the example below.

During training, the <|yue|> language token was repurposed to serve as an automatic language detection token, enabling code-switching during inference. To use this feature, simply set the language parameter to cantonese (which is the default).

However, the model's performance lags behind both the monolingual distilled version and Whisper-Large-v3-Turbo. Future work should explore better training procedures and potentially incorporate more data to effectively compress multilingual capabilities into a single model.

🚀 Quick Start

💻 Usage Examples

Basic Usage

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-distil-multi7-v0.2"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_name_or_path, torch_dtype=torch_dtype)
model.to(device)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "cs", split="test")
sample, text = dataset[0]["audio"], dataset[0]["text"]

# Ground truth text
print(text)
# Aber sei ihnen nicht böse, Habibi, vergib ihnen, sie vergaßen die Liebe, sie vergaßen die Bibel, 
# wünsch ihnen den Frieden. Nous allons construire des radiotélescopes géants comme celui-ci, 
# qui est mon préféré. Questa è un'immagine di Cairo Open City, una mostra che il museo Folkwang di 
# Essen ha dedicato al ruolo della mobile photography nella primavera Araba.

# Extract feautres
input_features = processor(
    sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features


# Generate tokens
predicted_ids = model.generate(
    input_features.to(device, dtype=torch_dtype),
    max_new_tokens=128,
)

# Detokenize to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
#  Aber sei ihnen nicht böse, Habibi, vergib ihn. Sie vergaßen die Liebe, sie vergaßen die Liebe. 
# Wünsche ihnen dem Frieden. Nous allons construire des radiotelescopes géants, comme celui-ci qui 
# est mon préféré. Esta es una imagen de Cairo Open City, una muestra que el Museo Folk Punk de Essen 
# ha dedicado al ruolo de la mobile fotografía en la primavera árabe.

# Dive in generated tokens
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)[0]
print(transcription)
# <|de|> Aber sei ihnen nicht böse, Habibi, vergib ihn. Sie vergaßen die Liebe, sie vergaßen die Liebe. 
# Wünsche ihnen dem Frieden.<|fr|> Nous allons construire des radiotelescopes géants, comme celui-ci qui 
# est mon préféré.<|es|> Esta es una imagen de Cairo Open City, una muestra que el Museo Folk Punk de Essen 
# ha dedicado al ruolo de la mobile fotografía en la primavera árabe.

📚 Documentation

🔍 Evaluation

English

Model	LIUM_tedlium	mcv17	voxpopuli	fleurs	kensho_spgispeech	librispeech-test_clean	librispeech-test_other	speechcolab_gigaspeech
openai/whisper-large-v3	10.58	10.13	8.93	5.72	2.95	1.87	3.58	10.07
openai/whisper-large-v3-turbo	10.20	11.74	11.78	6.13	2.95	1.98	3.94	10.11
distil-whisper/distil-large-v3	8.93	12.41	7.72	7.59	3.25	2.42	5.11	10.08
distil-whisper/distil-large-v3.5	8.65	11.07	7.54	6.74	2.86	2.28	4.94	9.84
bofenghuang/whisper-large-v3-distil-multi4-v0.2	8.88	11.33	7.60	6.97	3.03	2.51	5.24	10.12
bofenghuang/whisper-large-v3-distil-multi7-v0.2	9.36	11.32	7.65	7.02	2.99	2.46	5.24	10.06

French

Model	mcv17	mls	voxpopuli	mtedx	af_accented	fleurs	hf_dev_data_chunk30	hf_dev_data_sequential	mtedx_chunk30	mtedx_sequential
openai/whisper-large-v3	10.98	4.69	11.15	8.67	7.51	5.4	9.87	8.97	9	8.01
openai/whisper-large-v3-turbo	12.41	5.1	12.21	9.87	8.37	5.48	10.12	9	8.49	8.39
bofenghuang/whisper_large_v3_distil_fr_v0.2	11.1	5	10.68	8.75	7.09	6.35	9.44	9.84	8.94	8.93
bofenghuang/whisper-large-v3-distil-multi4-v0.2	11.96	6.04	11.07	9.16	7.99	7.10	10.42	12.61	9.06	11.75
bofenghuang/whisper-large-v3-distil-multi7-v0.2	12.19	6.2	11.29	9.13	8.26	7.17	10.04	12.26	8.93	11.56

Spanish

Model	mcv17	mls	voxpopuli	mtedx	fleurs	hf_dev_data_chunk30	hf_dev_data_sequential	mtedx_chunk30	mtedx_sequential
openai/whisper-large-v3	4.91	3.97	11.06	6.52	4.22	10.85	10.36	5.90	5.22
openai/whisper-large-v3-turbo	5.74	4.41	16.02	6.66	4.59	11.55	10.68	6.46	5.41
bofenghuang/whisper-large-v3-distil-multi4-v0.2	5.58	4.34	8.52	7.43	5.20	11.26	13.43	5.69	8.95
bofenghuang/whisper-large-v3-distil-multi7-v0.2	5.70	4.35	8.55	7.56	5.15	11.45	13.54	5.84	8.27

German

Model	mcv17	mls	voxpopuli	mtedx	fleurs	hf_dev_data_chunk30	hf_dev_data_sequential	mtedx_chunk30	mtedx_sequential
openai/whisper-large-v3	6.11	5.60	17.75	19.63	5.92	11.21	10.35	17.64	17.76
openai/whisper-large-v3-turbo	7.45	6.43	20.48	20.00	6.45	10.57	9.70	18.04	18.37
bofenghuang/whisper-large-v3-distil-multi4-v0.2	7.31	6.45	12.41	21.48	8.20	11.04	13.55	19.54	21.76
bofenghuang/whisper-large-v3-distil-multi7-v0.2	7.57	6.67	12.42	21.95	8.28	11.21	13.84	19.90	21.67

Italian

Model	mcv17	mls	voxpopuli	mtedx	fleurs	hf_dev_data_chunk30	hf_dev_data_sequential	mtedx_chunk30	mtedx_sequential
openai/whisper-large-v3	5.71	9.58	28.45	7.21	4.28	6.95	6.37	6.83	7.28
openai/whisper-large-v3-turbo	6.77	10.64	30.69	7.41	4.69	6.88	6.52	7.98	7.37
bofenghuang/whisper_large_v3_distil_it_v0.2	6.15	9.22	17.27	7.52	5.26	6.06	6.99	7.84	8.42
bofenghuang/whisper-large-v3-distil-multi7-v0.2	6.78	11.42	17.53	8.07	5.68	7.04	9.51	7.51	10.47

Portuguese

Model	mcv17	mls	mtedx	fleurs	hf_dev_data_chunk30	hf_dev_data_sequential	mtedx_chunk30	mtedx_sequential
openai/whisper-large-v3	6.76	7.04	8.91	5.86	12.11	12.39	8.70	8.98
openai/whisper-large-v3-turbo	7.66	6.64	8.84	6.11	12.42	11.62	10.97	9.04
bofenghuang/whisper-large-v3-distil-multi7-v0.2	8.31	6.75	10.11	7.10	12.74	14.97	9.64	11.78

Dutch

Model	mcv17	mls	voxpopuli	fleurs
openai/whisper-large-v3	4.51	66.95	23.35	6.99
openai/whisper-large-v3-turbo	6.16	52.37	27.42	7.59
bofenghuang/whisper-large-v3-distil-multi7-v0.2	6.76	14.82	14.92	10.86

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご