whisper-large-v3-distil-multi4-v0.2オープンソースモデル - 英語、フランス語、スペイン語、ドイツ語をサポートする多言語音声認識

ホーム

Whisper Large V3 Distil Multi4 V0.2

bofenghuangによって開発

これは多言語蒸留版のWhisperモデルで、2つのデコーダー層を持ち、4つのヨーロッパ言語（英語、フランス語、スペイン語、ドイツ語）をサポートしています。

音声認識

Transformers

複数言語対応オープンソースライセンス:MIT #多言語音声認識 #コードスイッチング #軽量デコーダー

ダウンロード数 70

リリース時間 : 12/5/2024

モデル概要

このモデルはWhisper-large-v3の蒸留版で、英語、フランス語、スペイン語、ドイツ語の自動音声認識タスクに特化しており、コードスイッチング機能をサポートしています。

モデル特徴

多言語サポート

英語、フランス語、スペイン語、ドイツ語の4つのヨーロッパ言語の音声認識をサポート

コードスイッチング

言語を自動検出し切り替えることができ、単一のセグメント文字起こしで複数言語を処理

蒸留アーキテクチャ

蒸留技術を用いてモデルを圧縮し、元のモデルの性能を維持しながら計算リソースの要求を削減

モデル能力

多言語音声認識

自動言語検出

コードスイッチング処理

使用事例

音声文字起こし

多言語会議議事録

複数言語を含む会議録音を文字起こし

異なる言語セグメントを自動認識し切り替え

多言語ポッドキャスト文字起こし

複数言語を含むポッドキャストコンテンツをテキストに変換

異なる言語段落を正確に認識し注釈付け

音声アシスタント

多言語音声入力

ユーザーが複数言語を混在させた音声入力をサポート

言語切り替えをシームレスに処理

🚀 Whisper-Large-V3-Distil-Multi4-v0.2

このモデルは、2つのデコーダー層を持つ多言語蒸留Whisperモデルで、英語、フランス語、スペイン語、ドイツ語という4つのヨーロッパ言語をサポートしています。

このモデルは、Distil-Large-v3.5の開発作業中に訓練されました。

注目すべき特徴は、コードスイッチングをネイティブサポートしていることです。このモデルは、言語の変化を検出すると自動的に新しい言語トークンを生成することで、単一のセグメントの文字起こし中に言語を切り替えることができます（以下の例で示されています）。

訓練中に<|yue|>言語トークンは、推論時にコードスイッチングを可能にする自動言語検出トークンとして再利用されています。この機能を使用するには、言語パラメータをcantonese（デフォルトで使用されます）に設定するだけです。

このモデルの性能は、単言語蒸留バージョンとWhisper-Large-v3-Turboの両方を下回っています。今後の作業では、より良い訓練手順を検討し、多言語機能を1つのモデルに効果的に圧縮するために、より多くのデータを組み込む必要があります。

🚀 クイックスタート

このセクションでは、Whisper-Large-V3-Distil-Multi4-v0.2モデルを使用して自動音声認識を行う基本的な手順を説明します。

✨ 主な機能

多言語対応：英語、フランス語、スペイン語、ドイツ語の4つのヨーロッパ言語をサポート。
コードスイッチングのサポート：単一のセグメントの文字起こし中に自動的に言語を切り替えることができる。

📦 インストール

このモデルを使用するには、必要なライブラリをインストールする必要があります。以下のコマンドを使用して、transformers、torch、datasetsライブラリをインストールできます。

pip install transformers torch datasets

💻 使用例

基本的な使用法

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-distil-multi4-v0.2"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_name_or_path, torch_dtype=torch_dtype)
model.to(device)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "cs", split="test")
sample, text = dataset[0]["audio"], dataset[0]["text"]

# Ground truth text
print(text)
# Aber sei ihnen nicht böse, Habibi, vergib ihnen, sie vergaßen die Liebe, sie vergaßen die Bibel, 
# wünsch ihnen den Frieden. Nous allons construire des radiotélescopes géants comme celui-ci, 
# qui est mon préféré. Questa è un'immagine di Cairo Open City, una mostra che il museo Folkwang di 
# Essen ha dedicato al ruolo della mobile photography nella primavera Araba.

# Extract feautres
input_features = processor(
    sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features


# Generate tokens
predicted_ids = model.generate(
    input_features.to(device, dtype=torch_dtype),
    max_new_tokens=128,
)

# Detokenize to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
#  Aber sei ihnen nicht böse, Habibi, vergib ihn. Sie vergaßen die Liebe, sie vergaßen die Liebe. 
# Wünsche ihnen dem Frieden. Nous allons construire des radiotelescopes géants, comme celui-ci qui 
# est mon préféré. Esta es una imagen de Cairo Open City, una muestra que el Museo Folk Punk de Essen 
# ha dedicado al ruolo de la mobile fotografía en la primavera árabe.

# Dive in generated tokens
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)[0]
print(transcription)
# <|de|> Aber sei ihnen nicht böse, Habibi, vergib ihn. Sie vergaßen die Liebe, sie vergaßen die Liebe. 
# Wünsche ihnen dem Frieden.<|fr|> Nous allons construire des radiotelescopes géants, comme celui-ci qui 
# est mon préféré.<|es|> Esta es una imagen de Cairo Open City, una muestra que el Museo Folk Punk de Essen 
# ha dedicado al ruolo de la mobile fotografía en la primavera árabe.

📚 ドキュメント

評価

このセクションでは、Whisper-Large-V3-Distil-Multi4-v0.2モデルと他の関連モデルの評価結果を示します。

英語

モデル	LIUM_tedlium	mcv17	voxpopuli	fleurs	kensho_spgispeech	librispeech-test_clean	librispeech-test_other	speechcolab_gigaspeech
openai/whisper-large-v3	10.58	10.13	8.93	5.72	2.95	1.87	3.58	10.07
openai/whisper-large-v3-turbo	10.20	11.74	11.78	6.13	2.95	1.98	3.94	10.11
distil-whisper/distil-large-v3	8.93	12.41	7.72	7.59	3.25	2.42	5.11	10.08
distil-whisper/distil-large-v3.5	8.65	11.07	7.54	6.74	2.86	2.28	4.94	9.84
bofenghuang/whisper-large-v3-distil-multi4-v0.2	8.88	11.33	7.60	6.97	3.03	2.51	5.24	10.12
bofenghuang/whisper-large-v3-distil-multi7-v0.2	9.36	11.32	7.65	7.02	2.99	2.46	5.24	10.06

フランス語

モデル	mcv17	mls	voxpopuli	mtedx	af_accented	fleurs	hf_dev_data_chunk30	hf_dev_data_sequential	mtedx_chunk30	mtedx_sequential
openai/whisper-large-v3	10.98	4.69	11.15	8.67	7.51	5.4	9.87	8.97	9	8.01
openai/whisper-large-v3-turbo	12.41	5.1	12.21	9.87	8.37	5.48	10.12	9	8.49	8.39
bofenghuang/whisper_large_v3_distil_fr_v0.2	11.1	5	10.68	8.75	7.09	6.35	9.44	9.84	8.94	8.93
bofenghuang/whisper-large-v3-distil-multi4-v0.2	11.96	6.04	11.07	9.16	7.99	7.10	10.42	12.61	9.06	11.75
bofenghuang/whisper-large-v3-distil-multi7-v0.2	12.19	6.2	11.29	9.13	8.26	7.17	10.04	12.26	8.93	11.56

スペイン語

モデル	mcv17	mls	voxpopuli	mtedx	fleurs	hf_dev_data_chunk30	hf_dev_data_sequential	mtedx_chunk30	mtedx_sequential
openai/whisper-large-v3	4.91	3.97	11.06	6.52	4.22	10.85	10.36	5.90	5.22
openai/whisper-large-v3-turbo	5.74	4.41	16.02	6.66	4.59	11.55	10.68	6.46	5.41
bofenghuang/whisper-large-v3-distil-multi4-v0.2	5.58	4.34	8.52	7.43	5.20	11.26	13.43	5.69	8.95
bofenghuang/whisper-large-v3-distil-multi7-v0.2	5.70	4.35	8.55	7.56	5.15	11.45	13.54	5.84	8.27

ドイツ語

モデル	mcv17	mls	voxpopuli	mtedx	fleurs	hf_dev_data_chunk30	hf_dev_data_sequential	mtedx_chunk30	mtedx_sequential
openai/whisper-large-v3	6.11	5.60	17.75	19.63	5.92	11.21	10.35	17.64	17.76
openai/whisper-large-v3-turbo	7.45	6.43	20.48	20.00	6.45	10.57	9.70	18.04	18.37
bofenghuang/whisper-large-v3-distil-multi4-v0.2	7.31	6.45	12.41	21.48	8.20	11.04	13.55	19.54	21.76
bofenghuang/whisper-large-v3-distil-multi7-v0.2	7.57	6.67	12.42	21.95	8.28	11.21	13.84	19.90	21.67