distil-whisper-large-v3-esオープンソースのスペイン語音声認識モデル - 無料でスペイン語の高精度認識をサポート

ホーム

Distil Whisper Large V3 Es

marianbastiによって開発

Whisper v3大規模モデルを蒸留したスペイン語音声認識モデル、SandboxAIとUniversidad Nacional de Rio Negroの共同開発

音声認識

Transformers

スペイン語オープンソースライセンス:MIT #スペイン語音声認識 #蒸留最適化モデル #長音声チャンク処理

ダウンロード数 64

リリース時間 : 1/26/2024

モデル概要

このモデルはスペイン語に最適化された音声認識モデルで、Whisper-large-v3を蒸留することで取得され、高い精度を維持しながら推論速度を向上

モデル特徴

効率的な推論

チャンクアルゴリズムを採用して長音声を処理、オリジナルWhisperより9倍高速

推測デコード対応

Whisperの補助モデルとして推測デコード可能、2倍の速度向上

スペイン語最適化

スペイン語音声認識タスク向けに特別に訓練・最適化

モデル能力

スペイン語音声文字起こし

長音声処理

リアルタイム音声認識

使用事例

音声文字起こし

会議議事録

スペイン語会議録音を自動的に文字記録に変換

メディア字幕生成

スペイン語動画コンテンツ向けに自動字幕生成

🚀 distil-whisper-large-v3-es

このリポジトリは、Whisper v3 largeモデルの蒸留版で、Mozilla Common Voiceデータセットv16.1 を使用して学習されました。このモデルは、SandboxAI と Universidad Nacional de Rio Negro の共同作業によって実現されました。

🚀 クイックスタート

Distil-Whisperは、Hugging Face 🤗 Transformersのバージョン4.35以降でサポートされています。モデルを実行するには、まずTransformersライブラリの最新バージョンをインストールします。この例では、Hugging Face Hubから玩具用の音声データセットをロードするために 🤗 Datasetsもインストールします。

pip install --upgrade pip
pip install --upgrade transformers accelerate datasets[audio]

💻 使用例

基本的な使用法

短い音声ファイルの文字起こし

モデルは、pipeline クラスを使用して、短い音声ファイル（30秒未満）の文字起こしを行うことができます。

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "marianbasti/distil-whisper-large-v3-es"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])

ローカルの音声ファイルを文字起こしするには、パイプラインを呼び出すときに音声ファイルのパスを渡します。

- result = pipe(sample)
+ result = pipe("audio.mp3")

長い音声ファイルの文字起こし

Distil-Whisperは、長い音声ファイル（30秒以上）の文字起こしにチャンク化アルゴリズムを使用します。実際には、このチャンク化された長文アルゴリズムは、Whisper論文でOpenAIが提案した逐次アルゴリズムよりも9倍高速です（Distil-Whisper論文の表7を参照）。

チャンク化を有効にするには、pipeline に chunk_length_s パラメータを渡します。Distil-Whisperの場合、チャンク長は15秒が最適です。バッチ処理を有効にするには、batch_size 引数を渡します。

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "marianbasti/distil-whisper-large-v3-es"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])

高度な使用法

推測的デコーディング

Distil-Whisperは、Whisperのアシスタントモデルとして推測的デコーディングに使用できます。推測的デコーディングは、数学的にWhisperとまったく同じ出力を保証しながら、2倍高速に動作します。これにより、同じ出力が保証されるため、既存のWhisperパイプラインの完全な代替品になります。

次のコードスニペットでは、メインのWhisperパイプラインに対して、アシスタントDistil-Whisperモデルを単独でロードします。そして、生成のための「アシスタントモデル」として指定します。

from transformers import pipeline, AutoModelForCausalLM, AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
assistant_model_id = "marianbasti/distil-whisper-large-v3-es"
assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
assistant_model.to(device)
model_id = "openai/whisper-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    generate_kwargs={"assistant_model": assistant_model},
    torch_dtype=torch_dtype,
    device=device,
)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])

🔧 技術詳細

学習

このモデルは、単一のRTX3090で約60時間、60,000ステップ（約1.47エポック）の最適化を行って学習されました。以下は学習パラメータです。

--teacher_model_name_or_path "openai/whisper-large-v3"
--train_dataset_name "mozilla-foundation/common_voice_16_1"
--train_dataset_config_name "es"
--train_split_name "train"
--text_column_name "sentence"
--eval_dataset_name "mozilla-foundation/common_voice_16_1"
--eval_dataset_config_name "es"
--eval_split_name "validation"
--eval_text_column_name "sentence"
--eval_steps 10000
--save_steps 10000
--warmup_steps 500
--learning_rate 1e-4
--lr_scheduler_type "linear"
--logging_steps 25
--save_total_limit 1
--max_steps 60000
--wer_threshold 10
--per_device_train_batch_size 8
--per_device_eval_batch_size 8
--dataloader_num_workers 12
--preprocessing_num_workers 12
--output_dir "./"
--do_train
--do_eval
--gradient_checkpointing
--predict_with_generate
--overwrite_output_dir
--use_pseudo_labels "false"
--freeze_encoder
--streaming False

結果

蒸留されたモデルは、5.11%のWER（10.15%の直交WER）で動作します。

📄 ライセンス

Distil-Whisperは、OpenAIのWhisperモデルから MITライセンスを引き継いでいます。

引用

このモデルを使用する場合は、Distil-Whisper論文を引用してください。

@misc{gandhi2023distilwhisper,
      title={Distil-Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling}, 
      author={Sanchit Gandhi and Patrick von Platen and Alexander M. Rush},
      year={2023},
      eprint={2311.00430},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}