faster-whisper-large-v3-french-distil-dec16オープンソースモデル - 推論を最適化し、効率的にフランス語音声認識を行う

ホーム

Faster Whisper Large V3 French Distil Dec16

brandenkmurrayによって開発

Whisper-Large-V3のフランス語蒸留バージョン、デコーダ層数を減らすことで推論効率を最適化しつつ良好な性能を維持

音声認識

Transformers

フランス語オープンソースライセンス:MIT #フランス語音声認識 #蒸留モデル #低単語誤り率

ダウンロード数 25

リリース時間 : 6/28/2024

モデル概要

このモデルはWhisper-Large-V3のフランス語専用蒸留バージョンで、デコーダ層数を32層から16層に削減することで、良好な認識精度を維持しつつ推論効率を向上させています。複数の推論フレームワークをサポートし、フランス語音声認識タスクに適しています。

モデル特徴

効率的な推論

デコーダ層数を削減することでメモリ使用量と推論時間を大幅に削減

マルチフレームワークサポート

transformers、openai-whisper、fasterwhisperなど複数のフレームワークをサポートする形式変換を提供

推測デコーディング互換

オリジナルのWhisperモデルと組み合わせて推測デコーディング技術を使用可能、推論速度をさらに向上

長文処理最適化

チャンク処理技術により長い音声入力を効果的に処理

モデル能力

フランス語音声認識

長音声転写

リアルタイム音声書き起こし

使用事例

音声書き起こし

フランス語会議議事録

フランス語会議録音をテキスト記録に変換

単語誤り率3.57%-8.76%（テストデータセットにより異なる）

フランス語メディアコンテンツ字幕生成

フランス語動画コンテンツに自動的に字幕を生成

音声分析

コールセンター音声分析

フランス語コールセンター会話内容を分析

ノイズを含む分野固有の用語において良好な性能

🚀 Whisper-Large-V3-French-Distil-Dec16

Whisper-Large-V3-French-Distilは、Whisper-Large-V3-French の蒸留バージョンシリーズです。これは、デコーダー層の数を32から16、8、4、または2に減らし、大規模データセットを使って蒸留することで実現されています。詳細はこの論文で説明されています。

蒸留されたバリアントは、メモリ使用量と推論時間を削減しながら、性能（保持された層の数に基づく）を維持し、特に長文の文字起こしにおける幻覚のリスクを軽減します。さらに、これらは元のWhisper-Large-V3-Frenchモデルと組み合わせて推測的デコードを行うことができ、単独のモデルを使用する場合と比較して、推論速度の向上と一貫した出力を実現します。

このモデルは、transformers、openai-whisper、fasterwhisper、whisper.cpp、candle、mlxなど、さまざまなライブラリでの使用を容易にするため、さまざまな形式に変換されています。

✨ 性能

私たちは、短い文字起こしと長い文字起こしの両方でモデルを評価し、分布内と分布外の両方のデータセットでテストを行い、精度、汎化性、およびロバスト性を包括的に分析しました。

報告されているWERは、数字をテキストに変換し、句読点（アポストロフィとハイフンを除く）を削除し、すべての文字を小文字に変換した後の結果であることに注意してください。

公開データセットでのすべての評価結果はこちらで確認できます。

短い文字起こし

eval-short-form

フランス語ですぐに利用できるドメイン外（OOD）および長文のテストセットが不足しているため、私たちは Zaion Lab の内部テストセットを使用して評価を行いました。これらのセットは、コールセンターの会話からの人間によるアノテーション付きの音声と文字起こしのペアで構成されており、著しい背景雑音とドメイン固有の用語が特徴です。

長い文字起こし

eval-long-form

長い文字起こしは、より迅速な評価のために🤗 Hugging Faceパイプラインを使用して実行されました。音声ファイルは30秒のチャンクに分割され、並列に処理されました。

💻 使用方法

Hugging Faceパイプライン

このモデルは、🤗 Hugging Faceの pipeline クラスを使用して、音声の文字起こしに簡単に利用できます。

長文の文字起こし（30秒を超える）の場合は、chunk_length_s 引数を渡すことで処理をアクティブにできます。このアプローチでは、音声を小さなセグメントに分割し、並列に処理し、最長共通シーケンスを見つけることでストライドで結合します。このチャンク化された長文アプローチは、OpenAIの逐次アルゴリズムと比較してわずかに性能が低下する可能性がありますが、9倍速の推論速度を提供します。

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-french-distil-dec16"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
model.to(device)

# Init pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    torch_dtype=torch_dtype,
    device=device,
    # chunk_length_s=30,  # for long-form transcription
    max_new_tokens=128,
)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Run pipeline
result = pipe(sample)
print(result["text"])

Hugging Face低レベルAPI

🤗 Hugging Faceの低レベルAPIを使用して文字起こしを行うこともできます。これにより、プロセスをより細かく制御できます。以下に例を示します。

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-french-distil-dec16"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
model.to(device)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Extract feautres
input_features = processor(
    sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features


# Generate tokens
predicted_ids = model.generate(
    input_features.to(dtype=torch_dtype).to(device), max_new_tokens=128
)

# Detokenize to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

推測的デコード

推測的デコードは、ドラフトモデル（本質的にはWhisperの蒸留バージョン）を使用して実現できます。このアプローチは、メインのWhisperモデルのみを使用する場合と同じ出力を保証し、2倍速の推論速度を提供し、メモリオーバーヘッドのわずかな増加のみを伴います。

蒸留されたWhisperは元のエンコーダーと同じであるため、推論時にはデコーダーのみをロードする必要があり、エンコーダーの出力はメインモデルとドラフトモデル間で共有されます。

Hugging Faceパイプラインで推測的デコードを使用するのは簡単です。生成設定内で assistant_model を指定するだけです。

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoModelForSpeechSeq2Seq,
    AutoProcessor,
    pipeline,
)

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model
model_name_or_path = "bofenghuang/whisper-large-v3-french"
processor = AutoProcessor.from_pretrained(model_name_or_path)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
model.to(device)

# Load draft model
assistant_model_name_or_path = "bofenghuang/whisper-large-v3-french-distil-dec2"
assistant_model = AutoModelForCausalLM.from_pretrained(
    assistant_model_name_or_path,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
)
assistant_model.to(device)

# Init pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    feature_extractor=processor.feature_extractor,
    tokenizer=processor.tokenizer,
    torch_dtype=torch_dtype,
    device=device,
    generate_kwargs={"assistant_model": assistant_model},
    max_new_tokens=128,
)

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]

# Run pipeline
result = pipe(sample)
print(result["text"])

OpenAI Whisper

OpenAIが元の論文で説明した、スライディングウィンドウと温度フォールバックを使用した逐次的な長文デコードアルゴリズムを使用することもできます。

まず、openai-whisper パッケージをインストールします。

pip install -U openai-whisper

次に、変換されたモデルをダウンロードします。

python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='bofenghuang/whisper-large-v3-french-distil-dec16', filename='original_model.pt', local_dir='./models/whisper-large-v3-french-distil-dec16')"

これで、リポジトリに記載されている使用方法に従って音声ファイルを文字起こしできます。

import whisper
from datasets import load_dataset

# Load model
model = whisper.load_model("./models/whisper-large-v3-french-distil-dec16/original_model.pt")

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]["array"].astype("float32")

# Transcribe
result = model.transcribe(sample, language="fr")
print(result["text"])

Faster Whisper

Faster Whisperは、OpenAIのWhisperモデルと逐次的な長文デコードアルゴリズムを CTranslate2 形式で再実装したものです。

openai-whisperと比較して、最大4倍の推論速度を提供し、メモリ使用量が少なくなります。さらに、モデルをint8に量子化することができ、CPUとGPUの両方で効率が向上します。

まず、faster-whisper パッケージをインストールします。

pip install faster-whisper

次に、CTranslate2形式に変換されたモデルをダウンロードします。

python -c "from huggingface_hub import snapshot_download; snapshot_download(repo_id='bofenghuang/whisper-large-v3-french-distil-dec16', local_dir='./models/whisper-large-v3-french-distil-dec16', allow_patterns='ctranslate2/*')"

これで、リポジトリに記載されている使用方法に従って音声ファイルを文字起こしできます。

from datasets import load_dataset
from faster_whisper import WhisperModel

# Load model
model = WhisperModel("./models/whisper-large-v3-french-distil-dec16/ctranslate2", device="cuda", compute_type="float16")  # Run on GPU with FP16

# Example audio
dataset = load_dataset("bofenghuang/asr-dummy", "fr", split="test")
sample = dataset[0]["audio"]["array"].astype("float32")

segments, info = model.transcribe(sample, beam_size=5, language="fr")

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))

Whisper.cpp

Whisper.cppは、OpenAIのWhisperモデルを、依存関係のない純粋なC/C++で再実装したものです。さまざまなバックエンドとプラットフォームと互換性があります。

さらに、モデルを4ビットまたは5ビットの整数に量子化することができ、効率が向上します。

まず、whisper.cpp リポジトリをクローンしてビルドします。

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp

# build the main example
make

次に、Hugging Face Hubから変換されたggmlウェイトをダウンロードします。

# Download model quantized with Q5_0 method
python -c "from huggingface_hub import hf_hub_download; hf_hub_download(repo_id='bofenghuang/whisper-large-v3-french-distil-dec16', filename='ggml-model-q5_0.bin', local_dir='./models/whisper-large-v3-french-distil-dec16')"

これで、次のコマンドを使用して音声ファイルを文字起こしできます。

./main -m ./models/whisper-large-v3-french-distil-dec16/ggml-model-q5_0.bin -l fr -f /path/to/audio/file --print-colors

Candle

Candle-whisper は、OpenAIのWhisperモデルをcandle形式（Rustで構築された軽量MLフレームワーク）で再実装したものです。

まず、candle リポジトリをクローンします。

git clone https://github.com/huggingface/candle.git
cd candle/candle-examples/examples/whisper

次のコマンドを使用して音声ファイルを文字起こしします。

cargo run --example whisper --release -- --model large-v3 --model-id bofenghuang/whisper-large-v3-french-distil-dec16 --language fr --input /path/to/audio/file

CUDAを使用する場合は、サンプルコマンドラインに --features cuda を追加します。

cargo run --example whisper --release --features cuda -- --model large-v3 --model-id bofenghuang/whisper-large-v3-french-distil-dec16 --language fr --input /path/to/audio/file

MLX

MLX-Whisper は、OpenAIのWhisperモデルを MLX 形式（Appleシリコン上のMLフレームワーク）で再実装したものです。これは、遅延評価や統一されたメモリ管理などの機能をサポートしています。

まず、MLX Examples リポジトリをクローンします。

git clone https://github.com/ml-explore/mlx-examples.git
cd mlx-examples/whisper

次に、依存関係をインストールします（残りの部分は原文が途中で途切れていますが、ここまで翻訳しました）。

🔧 学習の詳細

（原文に具体的な学習の詳細内容がないため、この部分は省略）

📄 謝辞

（原文に謝辞の具体的内容がないため、この部分は省略）

📄 ライセンス

このモデルはMITライセンスの下で提供されています。

おすすめAIモデル

Llama 3 Typhoon V1.5x 8b Instruct

タイ語専用に設計された80億パラメータの命令モデルで、GPT-3.5-turboに匹敵する性能を持ち、アプリケーションシナリオ、検索拡張生成、制限付き生成、推論タスクを最適化

Cadet-TinyはSODAデータセットでトレーニングされた超小型対話モデルで、エッジデバイス推論向けに設計されており、体積はCosmo-3Bモデルの約2％です。

Roberta Base Chinese Extractive Qa

RoBERTaアーキテクチャに基づく中国語抽出型QAモデルで、与えられたテキストから回答を抽出するタスクに適しています。

質問応答システム中国語

uer

2,694

未来を切り開く、あなたのAIソリューション知識ベース

English 简体中文繁體中文にほんご