Granite-speech-3.3-2bオープンソース音声モデル - 音声認識と翻訳の効率的実現

ホーム

Granite Speech 3.3 2b

ibm-graniteによって開発

Granite-speech-3.3-2bはIBMが開発したコンパクトで効率的な音声言語モデルで、自動音声認識（ASR）と自動音声翻訳（AST）に特化して設計され、双方向設計を採用してモジュール性とセキュリティを向上させています。

音声認識

Transformers

英語オープンソースライセンス:Apache-2.0 #双方向音声処理 #マルチタスク音声モデル #高精度ASR/AST

ダウンロード数 4,363

リリース時間 : 4/28/2025

モデル概要

このモデルは音声をテキストに変換する（ASR）ことと音声翻訳（AST）に特化しており、モジュール化設計を採用し、初回呼び出しで音声を文字起こしし、2回目の呼び出しでテキストを処理し、多言語タスクをサポートします。

モデル特徴

双方向設計

単方向統合モデルとは異なり、まず音声を独立して文字起こしし、その後テキストを処理することで、モジュール性とセキュリティを向上させます。

マルチタスクサポート

音声認識と音声翻訳タスクを同時にサポートし、様々なアプリケーションシーンに対応します。

効率的なアーキテクチャ

Conformerエンコーダ、q - formerダウンサンプラー、Granite大規模言語モデルを組み合わせ、性能と効率をバランスさせます。

LoRA適合

ランク64のLoRAアダプターを使用してクエリと値の投影行列を最適化し、モデルの柔軟性を向上させます。

モデル能力

音声をテキストに変換

異言語音声翻訳

長い音声処理（128kコンテキストをサポート）

使用事例

音声文字起こし

会議記録の自動化

会議の録音をリアルタイムで文字記録に変換

高い精度の英語文字起こし出力

リアルタイム翻訳

多言語音声翻訳

英語の音声を7つの目標言語にリアルタイムで翻訳

ドイツ語/スペイン語/フランス語/イタリア語/日本語/ポルトガル語/中国語の出力をサポート

🚀 Granite-speech-3.3-2b

Granite-speech-3.3-2bは、コンパクトで効率的な音声言語モデルです。自動音声認識（ASR）と自動音声翻訳（AST）に特化して設計されています。Granite-speech-3.3-2bは、音声と言語を1つのパスに統合する統合モデルとは異なり、2パス設計を採用しています。最初の呼び出しでは、音声ファイルをテキストに文字起こしします。転記されたテキストを基盤となるGranite言語モデルで処理するには、ユーザーは2回目の呼び出しを行う必要があり、各ステップは明示的に開始する必要があります。

このモデルは、ASRとASTのための多様なデータセットを含む公開コーパスと、音声翻訳タスクをサポートするために作成された合成データセットで訓練されました。Granite-speech-3.3-2bは、公開されているオープンソースコーパス（音声入力とテキストターゲットを含む）で、granite-3.3-2b-instruct（https://huggingface.co/ibm-granite/granite-3.3-2b-instruct）を音声にモダリティアライメントさせることで訓練されました。

現在、貪欲デコーディング（num_beams=1）に関する問題を調査中です。ビームサイズが1より大きい場合、モデルは安定した性能を発揮します。すべてのユースケースでこの設定を推奨します。また、非常に短い音声入力（<0.1秒）では、モデルが誤った出力を生成することがあります。これらの問題は積極的に調査されており、修正策が利用可能になったらガイダンスを更新します。

🚀 クイックスタート

モデルの概要

Granite-speech-3.3-2bは、自動音声認識（ASR）と自動音声翻訳（AST）に特化したコンパクトで効率的な音声言語モデルです。

評価

私たちは、granite-speech-3.3-2bを、granite-speech-3.3-8b（https://huggingface.co/ibm-granite/granite-speech-3.3-8b）や、80億パラメータ未満の他の音声言語モデル（SLM）、および標準的なベンチマーク上の専用ASRとASTシステムと並行して評価しました。評価は複数の公開ベンチマークにわたり、特に英語のASRタスクに重点を置き、En-X翻訳のASTも含まれています。

image/png

リリース日

2025年5月2日

サポート言語

英語

想定される使用方法

このモデルは、音声入力の処理を伴うエンタープライズアプリケーションでの使用を想定しています。特に、英語の音声テキスト変換や、英語からフランス語、スペイン語、イタリア語、ドイツ語、ポルトガル語、日本語、中国語などの主要なヨーロッパ言語への音声翻訳に適しています。また、音声を含まないプロンプトが指定された場合、基盤となるgranite-3.3-2b-instructを呼び出すため、テキストのみの入力を伴うタスクにも使用できます。

✨ 主な機能

自動音声認識（ASR）と自動音声翻訳（AST）に特化した設計
2パス設計により、音声処理と言語処理を分離
多様な公開コーパスと合成データセットで訓練されたモデル

📦 インストール

`transformers` を使用する場合

まず、最新バージョンのtransformersをソースからビルドします。

pip install https://github.com/huggingface/transformers/archive/main.zip torchaudio peft soundfile

`vLLM` を使用する場合

まず、最新バージョンのvLLMをインストールします。

pip install vllm --upgrade

💻 使用例

`transformers` を使用する場合

import torch
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from huggingface_hub import hf_hub_download

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "ibm-granite/granite-speech-3.3-2b"
speech_granite_processor = AutoProcessor.from_pretrained(
    model_name)
tokenizer = speech_granite_processor.tokenizer
speech_granite = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name).to(device)

# prepare speech and text prompt, using the appropriate prompt template

audio_path = hf_hub_download(repo_id=model_name, filename='10226_10111_000000.wav')
wav, sr = torchaudio.load(audio_path, normalize=True)
assert wav.shape[0] == 1 and sr == 16000 # mono, 16khz

# create text prompt
chat = [
    {
        "role": "system",
        "content": "Knowledge Cutoff Date: April 2024.\nToday's Date: May 2, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
    },
    {
        "role": "user",
        "content": "<|audio|>can you transcribe the speech into a written format?",
    }
]

text = tokenizer.apply_chat_template(
    chat, tokenize=False, add_generation_prompt=True
)

# compute audio embeddings
model_inputs = speech_granite_processor(
    text,
    wav,
    device=device, # Computation device; returned tensors are put on CPU
    return_tensors="pt",
).to(device)
 
model_outputs = speech_granite.generate(
    **model_inputs,
    max_new_tokens=200,
    num_beams=4,
    do_sample=False,
    min_length=1,
    top_p=1.0,
    repetition_penalty=1.0,
    length_penalty=1.0,
    temperature=1.0,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
)

# Transformers includes the input IDs in the response.
num_input_tokens = model_inputs["input_ids"].shape[-1]
new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0)

output_text = tokenizer.batch_decode(
    new_tokens, add_special_tokens=False, skip_special_tokens=True
)
print(f"STT output = {output_text[0].upper()}")

`vLLM` を使用する場合

オフラインモードのコード

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset
from vllm.lora.request import LoRARequest

model_id = "ibm-granite/granite-speech-3.3-2b"
tokenizer = AutoTokenizer.from_pretrained(model_id)

def get_prompt(question: str, has_audio: bool):
    """Build the input prompt to send to vLLM."""
    if has_audio:
        question = f"<|audio|>{question}"
    chat = [
        {
            "role": "user",
            "content": question
        }
    ]
    return tokenizer.apply_chat_template(chat, tokenize=False)

# NOTE - you may see warnings about multimodal lora layers being ignored;
# this is okay as the lora in this model is only applied to the LLM.
model = LLM(
    model=model_id,
    enable_lora=True,
    max_lora_rank=64,
    max_model_len=2048, # This may be needed for lower resource devices.
    limit_mm_per_prompt={"audio": 1},
)

### 1. Example with Audio [make sure to use the lora]
question = "can you transcribe the speech into a written format?"
prompt_with_audio = get_prompt(
    question=question,
    has_audio=True,
)
audio = AudioAsset("mary_had_lamb").audio_and_sample_rate

inputs = {
    "prompt": prompt_with_audio,
    "multi_modal_data": {
        "audio": audio,
    }
}

outputs = model.generate(
    inputs,
    sampling_params=SamplingParams(
        temperature=0.2,
        max_tokens=64,
    ),
    lora_request=[LoRARequest("speech", 1, model_id)]
)
print(f"Audio Example - Question: {question}")
print(f"Generated text: {outputs[0].outputs[0].text}")


### 2. Example without Audio [do NOT use the lora]
question = "What is the capital of Brazil?"
prompt = get_prompt(
    question=question,
    has_audio=False,
)

outputs = model.generate(
    {"prompt": prompt},
    sampling_params=SamplingParams(
        temperature=0.2,
        max_tokens=12,
    ),
)
print(f"Text Only Example - Question: {question}")
print(f"Generated text: {outputs[0].outputs[0].text}")

オンラインモードのコード

"""
Launch the vLLM server with the following command:

vllm serve ibm-granite/granite-speech-3.3-2b \
    --api-key token-abc123 \
    --max-model-len 2048 \
    --enable-lora  \
    --lora-modules speech=ibm-granite/granite-speech-3.3-2b \
    --max-lora-rank 64
"""

import base64

import requests
from openai import OpenAI

from vllm.assets.audio import AudioAsset

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "token-abc123"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

base_model_name = "ibm-granite/granite-speech-3.3-2b"
lora_model_name = "speech"
# Any format supported by librosa is supported
audio_url = AudioAsset("mary_had_lamb").url

# Use base64 encoded audio in the payload
def encode_audio_base64_from_url(audio_url: str) -> str:
    """Encode an audio retrieved from a remote url to base64 format."""
    with requests.get(audio_url) as response:
        response.raise_for_status()
        result = base64.b64encode(response.content).decode('utf-8')
    return result

audio_base64 = encode_audio_base64_from_url(audio_url=audio_url)

### 1. Example with Audio
# NOTE: we pass the name of the lora model (`speech`) here because we have audio.
question = "can you transcribe the speech into a written format?"
chat_completion_with_audio = client.chat.completions.create(
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": question
            },
            {
                "type": "audio_url",
                "audio_url": {
                    # Any format supported by librosa is supported
                    "url": f"data:audio/ogg;base64,{audio_base64}"
                },
            },
        ],
    }],
    temperature=0.2,
    max_tokens=64,
    model=lora_model_name,
)


print(f"Audio Example - Question: {question}")
print(f"Generated text: {chat_completion_with_audio.choices[0].message.content}")


### 2. Example without Audio
# NOTE: we pass the name of the base model here because we do not have audio.
question = "What is the capital of Brazil?"
chat_completion_with_audio = client.chat.completions.create(
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": question
            },
        ],
    }],
    temperature=0.2,
    max_tokens=12,
    model=base_model_name,
)

print(f"Text Only Example - Question: {question}")
print(f"Generated text: {chat_completion_with_audio.choices[0].message.content}")

📚 ドキュメント

モデルアーキテクチャ

Granite-speech-3.3-2bのアーキテクチャは、以下のコンポーネントで構成されています。

(1) 音声エンコーダ：10個のコンフォーマブロックで構成され、文字レベルのターゲットに対してConnectionist Temporal Classification（CTC）を使用して訓練されています。また、4秒の音声ブロックでブロックアテンションを使用し、中間層から自己条件付きCTCを使用しています。

設定パラメータ	値
入力次元	160 (80 logmels x 2)
レイヤー数	10
隠れ次元	1024
アテンションヘッド数	8
アテンションヘッドサイズ	128
畳み込みカーネルサイズ	15
出力次元	42

(2) 音声プロジェクタと時間的ダウンサンプラー（音声テキストモダリティアダプタ）：音声エンコーダの最後のコンフォーマブロックから出力される15個の1024次元の音響埋め込みブロックに対して、2層のウィンドウクエリトランスフォーマー（q-former）を使用し、各ブロックと各レイヤーに3つの学習可能なクエリを使用して5倍にダウンサンプリングします。全体の時間的ダウンサンプリング係数は10（エンコーダから2倍、プロジェクタから5倍）で、LLMに対して10Hzの音響埋め込みレートをもたらします。エンコーダ、プロジェクタ、LoRAアダプタは、訓練データで説明されているすべてのコーパスで共同で微調整/訓練されました。

(3) 大規模言語モデル：128kのコンテキスト長を持つgranite-3.3-2b-instruct（https://huggingface.co/ibm-granite/granite-3.3-2b-instruct）

(4) LoRAアダプタ：クエリ、値の投影行列に適用されるrank=64

訓練データ

全体として、私たちの訓練データは主に2つの主要なソースから構成されています。(1) 公開されているデータセット (2) 音声翻訳タスクを対象として公開されているデータセットから作成された合成データ。訓練データセットの詳細な説明は、以下の表に記載されています。

名前	タスク	時間数	ソース
CommonVoice-17 English	ASR	2600	https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0
MLS English	ASR	44000	https://huggingface.co/datasets/facebook/multilingual_librispeech
Librispeech	ASR	1000	https://huggingface.co/datasets/openslr/librispeech_asr
VoxPopuli English	ASR	500	https://huggingface.co/datasets/facebook/voxpopuli
AMI	ASR	100	https://huggingface.co/datasets/edinburghcstr/ami
YODAS English	ASR	10000	https://huggingface.co/datasets/espnet/yodas
Switchboard English	ASR	260	https://catalog.ldc.upenn.edu/LDC97S62
CallHome English	ASR	18	https://catalog.ldc.upenn.edu/LDC97T14
Fisher	ASR	2000	https://catalog.ldc.upenn.edu/LDC2004S13
Voicemail part I	ASR	40	https://catalog.ldc.upenn.edu/LDC98S77
Voicemail part II	ASR	40	https://catalog.ldc.upenn.edu/LDC2002S35
CommonVoice-17 En->De,Es,Fr,It,Ja,Pt,Zh	AST	2600*7	Translations with Phi-4 and MADLAD

インフラストラクチャ

私たちは、IBMのスーパーコンピューティングクラスタであるBlue Velaを使用してGranite Speechを訓練しています。このクラスタは、NVIDIA H100 GPUを搭載しており、数千のGPUでモデルを訓練するためのスケーラブルで効率的なインフラストラクチャを提供します。この特定のモデルの訓練は、32台のH100 GPUで9日間で完了しました。

倫理的考慮事項と制限事項

ユーザーは、貪欲デコーディング（num_beams=1）を使用する場合や、非常に短い音声クリップ（<0.1秒）を処理する場合、モデルが信頼できない出力を生成する可能性があることに注意する必要があります。今後のアップデートがリリースされるまで、ビームサイズを1より大きくすることを推奨し、0.1秒未満の入力を避けることで、より安定した性能を確保します。

大規模音声言語モデルの使用には、バイアスと公平性、誤情報、自律的な決定など、人々が認識すべきリスクと倫理的な考慮事項が含まれる場合があります。私たちは、コミュニティに対して、granite-speech-3.3-2bをIBMの責任ある使用ガイドまたは同様の責任ある使用構造に沿って使用することを強く促します。IBMは、このモデルを自動音声認識タスクに使用することを推奨しています。モデルのモジュール設計により、音声入力がシステムに与える影響を制限することで安全性が向上しています。不慣れまたは不正な形式のプロンプトを受け取った場合、モデルはそれを文字起こししてエコーバックするだけです。これにより、音声を直接解釈する統合モデルとは異なり、敵対的な入力のリスクを最小限に抑えることができます。ただし、より一般的な音声タスクでは、望ましくない出力がトリガーされる固有のリスクが高くなる可能性があります。

安全性を強化するために、granite-speech-3.3-2bをGranite Guardianと併用することを推奨します。Granite Guardianは、IBMのAIリスクアトラスに概説されている主要な次元にわたって、プロンプトと応答のリスクを検出してフラグを立てるように設計された微調整された命令モデルです。人間によるアノテーションと内部レッドチーミングに基づく合成データを含む訓練により、標準的なベンチマークで同様のオープンソースモデルを上回る性能を発揮し、追加の安全層を提供します。