Distil-large-v2オープンソース音声認識モデル - 速度が6倍早く、サイズが49%小さく、精度が高い

ホーム

Distil Large V2

distil-whisperによって開発

Distil-WhisperはWhisperモデルの蒸留版で、速度が6倍向上し、サイズが49%縮小され、非分布評価セットでの性能はWERでわずか1%の差しかありません。

音声認識英語オープンソースライセンス:MIT #英語音声認識 #効率的な推論 #長音声処理

ダウンロード数 42.65k

リリース時間 : 10/24/2023

モデル概要

Distil-WhisperはWhisperモデルの蒸留版で、英語音声認識に最適化されており、効率的な自動音声認識機能を提供します。

モデル特徴

効率的な推論

元のWhisperモデルより6倍高速で、リアルタイムアプリケーションに適しています。

サイズ最適化

モデルサイズが49%縮小され、メモリ使用量が減少します。

高性能

非分布評価セットでの性能は元のモデルと比べてWERでわずか1%の差しかありません。

長形式転写サポート

チャンクアルゴリズムによる長形式音声処理をサポートし、シーケンシャルアルゴリズムより9倍高速です。

モデル能力

英語音声認識

短形式音声転写

長形式音声転写

推測デコード

使用事例

音声転写

会議議事録

会議の録音を文字記録に変換します。

ポッドキャスト転写

ポッドキャストの内容を検索やアーカイブ用に文字に変換します。

支援技術

リアルタイム字幕生成

動画やライブ配信のためのリアルタイム字幕を生成します。

🚀 Distil-Whisper: distil-large-v2

Distil-Whisperは論文 Robust Knowledge Distillation via Large-Scale Pseudo Labelling で提案されました。

これはWhisperモデルの蒸留バージョンで、6倍高速で、サイズが49%小さく、分布外評価セットでWERが1%以内の性能を発揮します。これは Whisper large-v2 の蒸留バリアントであるdistil-large-v2のリポジトリです。

モデル	パラメータ数 / M	相対的なレイテンシ ↑	短文WER ↓	長文WER ↓
large-v3	1550	1.0	8.4	11.0
large-v2	1550	1.0	9.1	11.7

distil-large-v3	756	6.3	9.7	10.8
distil-large-v2	756	5.8	10.1	11.6
distil-medium.en	394	6.8	11.1	12.4
distil-small.en	166	5.6	12.1	12.8

⚠️ 重要提示

OpenAIのWhisper large-v3のリリースに伴い、更新された distil-large-v3 モデルが公開されました。この distil-large-v3 モデルは、distil-large-v2モデルの性能を上回り、アーキテクチャに変更はなく、逐次的な長文生成のサポートも向上しています。したがって、large-v2モデルの代わりに distil-large-v3 モデルを使用することをお勧めします。

💡 使用建议

Distil-Whisperは現在、英語の音声認識のみ利用可能です。他の言語でのWhisperの蒸留に取り組んでいます。あなたの言語でWhisperを蒸留することに興味がある場合は、提供されているトレーニングコードをご確認ください。準備ができたら、Distil-Whisperリポジトリに多言語チェックポイントを更新します！

🚀 クイックスタート

Distil-Whisperはバージョン4.35以降のHugging Face 🤗 Transformersでサポートされています。モデルを実行するには、まずTransformersライブラリの最新バージョンをインストールします。この例では、Hugging Face Hubから玩具用の音声データセットをロードするために 🤗 Datasetsもインストールします。

pip install --upgrade pip
pip install --upgrade transformers accelerate datasets[audio]

💻 使用例

基本的な使用法

短文文字起こし

モデルは、pipeline クラスを使用して、短文の音声ファイル（30秒未満）を以下のように文字起こしすることができます。

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

ローカルの音声ファイルを文字起こしするには、パイプラインを呼び出すときに音声ファイルのパスを渡すだけです。

- result = pipe(sample)
+ result = pipe("audio.mp3")

長文文字起こし

Distil-Whisperは、長文の音声ファイル（30秒以上）を文字起こしするためにチャンク化アルゴリズムを使用します。実際には、このチャンク化された長文アルゴリズムは、Whisper論文でOpenAIが提案した逐次アルゴリズムよりも9倍高速です（Distil-Whisper論文の表7を参照）。

チャンク化を有効にするには、pipeline に chunk_length_s パラメータを渡します。Distil-Whisperの場合、チャンク長は15秒が最適です。バッチ処理を有効にするには、batch_size 引数を渡します。

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "distil-whisper/distil-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=15,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

推測的デコード

Distil-Whisperは、推測的デコードのためのWhisperのアシスタントモデルとして使用できます。推測的デコードは数学的にWhisperとまったく同じ出力を保証しながら、2倍高速です。これにより、同じ出力が保証されるため、既存のWhisperパイプラインの完全な代替品になります。

次のコードスニペットでは、メインのWhisperパイプラインに対して、アシスタントDistil-Whisperモデルを単独でロードします。そして、生成のための「アシスタントモデル」として指定します。

from transformers import pipeline, AutoModelForCausalLM, AutoModelForSpeechSeq2Seq, AutoProcessor
import torch
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

assistant_model_id = "distil-whisper/distil-large-v2"

assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
assistant_model.to(device)

model_id = "openai/whisper-large-v2"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    generate_kwargs={"assistant_model": assistant_model},
    torch_dtype=torch_dtype,
    device=device,
)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

高度な使用法

追加の速度とメモリの改善

以下に、Distil-Whisperに適用できる追加の速度とメモリの改善方法を説明します。

Flash Attention

GPUがサポートしている場合は、Flash-Attention 2 の使用をお勧めします。そのためには、まず Flash Attention をインストールする必要があります。

pip install flash-attn --no-build-isolation

そして、from_pretrained に use_flash_attention_2=True を渡すだけです。

- model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True, use_flash_attention_2=True)

Torch Scale-Product-Attention (SDPA)

GPUがFlash Attentionをサポートしていない場合は、BetterTransformers の使用をお勧めします。そのためには、まずoptimumをインストールする必要があります。

pip install --upgrade optimum

そして、モデルを使用する前に「BetterTransformer」モデルに変換します。

model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
+ model = model.to_bettertransformer()

`openai-whisper` でDistil-Whisperを実行する

モデルを元のWhisper形式で使用するには、まず openai-whisper パッケージがインストールされていることを確認します。

pip install --upgrade openai-whisper

次のコードスニペットは、🤗 Datasetsを使用してロードされたLibriSpeechデータセットのサンプルファイルを文字起こしする方法を示しています。

import torch
from datasets import load_dataset
from huggingface_hub import hf_hub_download
from whisper import load_model, transcribe

distil_large_v2 = hf_hub_download(repo_id="distil-whisper/distil-large-v2", filename="original-model.bin")
model = load_model(distil_large_v2)

dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]["array"]
sample = torch.from_numpy(sample).float()

pred_out = transcribe(model, audio=sample)
print(pred_out["text"])

ローカルの音声ファイルを文字起こしするには、transcribe に audio 引数として音声ファイルのパスを渡すだけです。

pred_out = transcribe(model, audio="audio.mp3")

Whisper.cpp

Distil-Whisperは、Whisper.cpp リポジトリから元の逐次的な長文文字起こしアルゴリズムで実行できます。暫定的なベンチマークでは、Mac M1で distil-large-v2 は large-v2 よりも2倍高速で、長文音声でのWERが0.1%以内です。

将来のDistil-Whisperのリリースでは、より高速なCPU推論を目指します！より小さなエンコーダを蒸留することで、GPUで得られるのと同様の高速化を達成することを目指しています。

始めるための手順：

Whisper.cppリポジトリをクローンします。

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp

Hugging Face Hubから distil-medium.en のggmlウェイトをダウンロードします。

python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='distil-whisper/distil-large-v2', filename='ggml-large-32-2.en.bin', local_dir='./models')"

huggingface_hub パッケージがインストールされていない場合は、wget でウェイトをダウンロードすることもできます。

wget https://huggingface.co/distil-whisper/distil-large-v2/resolve/main/ggml-large-32-2.en.bin -P ./models

提供されているサンプル音声を使用して推論を実行します。

make -j && ./main -m models/ggml-large-32-2.en.bin -f samples/jfk.wav

Transformers.js

import { pipeline } from '@huggingface/transformers';

const transcriber = await pipeline('automatic-speech-recognition', 'distil-whisper/distil-large-v2');

const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
const output = await transcriber(url);
// { text: " And so, my fellow Americans, ask not what your country can do for you. Ask what you can do for your country." }

詳細情報はドキュメントを参照してください。

注意: モデルサイズが大きいため、このモデルは Node.js でサーバーサイドで実行することをお勧めします（ブラウザでの実行ではなく）。

Candle

Hugging Face Candle 🕯️ との統合を通じて、Distil-WhisperはRustライブラリ 🦀 で利用可能になりました。

以下の利点があります。

x86用のオプションMKLサポートとMac用のAccelerateを備えた最適化されたCPUバックエンド
GPUで効率的に実行するためのCUDAバックエンド、NCCLを介した複数GPU分散
WASMサポート：ブラウザでDistil-Whisperを実行

始めるための手順：

こちらで説明されているように、candle-core をインストールします。
candle リポジトリをローカルにクローンします。

git clone https://github.com/huggingface/candle.git

Whisper のサンプルディレクトリに移動します。

cd candle/candle-examples/examples/whisper

サンプルを実行します。

cargo run --example whisper --release -- --model distil-large-v2

独自の音声ファイルを指定するには、--input フラグを追加します。

cargo run --example whisper --release -- --model distil-large-v2 --input audio.wav

8bitと4bit量子化

近日公開予定...

Whisper.cpp

近日公開予定...

🔧 技術詳細

Distil-WhisperはWhisperからエンコーダ-デコーダアーキテクチャを継承しています。エンコーダは音声ベクトル入力のシーケンスを隠れ状態ベクトルのシーケンスにマッピングします。デコーダは、すべての以前のトークンとエンコーダの隠れ状態に条件付けられたテキストトークンを自己回帰的に予測します。