模型概述
模型特點
模型能力
使用案例
🚀 Granite-speech-3.3-8b
Granite-speech-3.3-8b是一款緊湊高效的語音語言模型,專為自動語音識別(ASR)和自動語音翻譯(AST)而設計。它採用雙通設計,與將語音和語言處理整合為單通的集成模型不同。首次調用該模型時,它會將音頻文件轉錄為文本;若要使用底層的Granite語言模型處理轉錄後的文本,用戶需進行第二次調用,因為每個步驟都必須顯式啟動。
🚀 快速開始
Granite Speech模型在transformers
庫的main
分支中得到原生支持。以下是使用granite-speech-3.3-8b
模型的簡單示例。
使用transformers
庫
首先,確保從源代碼構建最新版本的transformers
庫:
pip install https://github.com/huggingface/transformers/archive/main.zip torchaudio peft soundfile
然後運行以下代碼:
import torch
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from huggingface_hub import hf_hub_download
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "ibm-granite/granite-speech-3.3-8b"
speech_granite_processor = AutoProcessor.from_pretrained(
model_name)
tokenizer = speech_granite_processor.tokenizer
speech_granite = AutoModelForSpeechSeq2Seq.from_pretrained(
model_name).to(device)
# prepare speech and text prompt, using the appropriate prompt template
audio_path = hf_hub_download(repo_id=model_name, filename='10226_10111_000000.wav')
wav, sr = torchaudio.load(audio_path, normalize=True)
assert wav.shape[0] == 1 and sr == 16000 # mono, 16khz
# create text prompt
chat = [
{
"role": "system",
"content": "Knowledge Cutoff Date: April 2024.\nToday's Date: April 9, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
},
{
"role": "user",
"content": "<|audio|>can you transcribe the speech into a written format?",
}
]
text = tokenizer.apply_chat_template(
chat, tokenize=False, add_generation_prompt=True
)
# compute audio embeddings
model_inputs = speech_granite_processor(
text,
wav,
device=device, # Computation device; returned tensors are put on CPU
return_tensors="pt",
).to(device)
model_outputs = speech_granite.generate(
**model_inputs,
max_new_tokens=200,
num_beams=4,
do_sample=False,
min_length=1,
top_p=1.0,
repetition_penalty=1.0,
length_penalty=1.0,
temperature=1.0,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
# Transformers includes the input IDs in the response.
num_input_tokens = model_inputs["input_ids"].shape[-1]
new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0)
output_text = tokenizer.batch_decode(
new_tokens, add_special_tokens=False, skip_special_tokens=True
)
print(f"STT output = {output_text[0].upper()}")
使用vLLM
庫
首先,確保安裝最新版本的vLLM
庫:
pip install vllm --upgrade
離線模式代碼
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset
from vllm.lora.request import LoRARequest
model_id = "ibm-granite/granite-speech-3.3-8b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
def get_prompt(question: str, has_audio: bool):
"""Build the input prompt to send to vLLM."""
if has_audio:
question = f"<|audio|>{question}"
chat = [
{
"role": "user",
"content": question
}
]
return tokenizer.apply_chat_template(chat, tokenize=False)
# NOTE - you may see warnings about multimodal lora layers being ignored;
# this is okay as the lora in this model is only applied to the LLM.
model = LLM(
model=model_id,
enable_lora=True,
max_lora_rank=64,
max_model_len=2048, # This may be needed for lower resource devices.
limit_mm_per_prompt={"audio": 1},
)
### 1. Example with Audio [make sure to use the lora]
question = "can you transcribe the speech into a written format?"
prompt_with_audio = get_prompt(
question=question,
has_audio=True,
)
audio = AudioAsset("mary_had_lamb").audio_and_sample_rate
inputs = {
"prompt": prompt_with_audio,
"multi_modal_data": {
"audio": audio,
}
}
outputs = model.generate(
inputs,
sampling_params=SamplingParams(
temperature=0.2,
max_tokens=64,
),
lora_request=[LoRARequest("speech", 1, model_id)]
)
print(f"Audio Example - Question: {question}")
print(f"Generated text: {outputs[0].outputs[0].text}")
### 2. Example without Audio [do NOT use the lora]
question = "What is the capital of Brazil?"
prompt = get_prompt(
question=question,
has_audio=False,
)
outputs = model.generate(
{"prompt": prompt},
sampling_params=SamplingParams(
temperature=0.2,
max_tokens=12,
),
)
print(f"Text Only Example - Question: {question}")
print(f"Generated text: {outputs[0].outputs[0].text}")
在線模式代碼
"""
Launch the vLLM server with the following command:
vllm serve ibm-granite/granite-speech-3.3-8b \
--api-key token-abc123 \
--max-model-len 2048 \
--enable-lora \
--lora-modules speech=ibm-granite/granite-speech-3.3-8b \
--max-lora-rank 64
"""
import base64
import requests
from openai import OpenAI
from vllm.assets.audio import AudioAsset
# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "token-abc123"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
# defaults to os.environ.get("OPENAI_API_KEY")
api_key=openai_api_key,
base_url=openai_api_base,
)
base_model_name = "ibm-granite/granite-speech-3.3-8b"
lora_model_name = "speech"
# Any format supported by librosa is supported
audio_url = AudioAsset("mary_had_lamb").url
# Use base64 encoded audio in the payload
def encode_audio_base64_from_url(audio_url: str) -> str:
"""Encode an audio retrieved from a remote url to base64 format."""
with requests.get(audio_url) as response:
response.raise_for_status()
result = base64.b64encode(response.content).decode('utf-8')
return result
audio_base64 = encode_audio_base64_from_url(audio_url=audio_url)
### 1. Example with Audio
# NOTE: we pass the name of the lora model (`speech`) here because we have audio.
question = "can you transcribe the speech into a written format?"
chat_completion_with_audio = client.chat.completions.create(
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": question
},
{
"type": "audio_url",
"audio_url": {
# Any format supported by librosa is supported
"url": f"data:audio/ogg;base64,{audio_base64}"
},
},
],
}],
temperature=0.2,
max_tokens=64,
model=lora_model_name,
)
print(f"Audio Example - Question: {question}")
print(f"Generated text: {chat_completion_with_audio.choices[0].message.content}")
### 2. Example without Audio
# NOTE: we pass the name of the base model here because we do not have audio.
question = "What is the capital of Brazil?"
chat_completion_with_audio = client.chat.completions.create(
messages=[{
"role": "user",
"content": [
{
"type": "text",
"text": question
},
],
}],
temperature=0.2,
max_tokens=12,
model=base_model_name,
)
print(f"Text Only Example - Question: {question}")
print(f"Generated text: {chat_completion_with_audio.choices[0].message.content}")
✨ 主要特性
- 緊湊高效:專為自動語音識別(ASR)和自動語音翻譯(AST)設計,採用雙通設計,提高處理效率。
- 多模態適配:通過語音編碼器、語音投影器和時間下采樣器等組件,實現語音和文本模態的有效適配。
- 可擴展性:基於IBM的超級計算集群進行訓練,具備可擴展的訓練基礎設施。
📦 安裝指南
使用transformers
庫時,需從源代碼構建最新版本:
pip install https://github.com/huggingface/transformers/archive/main.zip torchaudio peft soundfile
使用vLLM
庫時,需安裝最新版本:
pip install vllm --upgrade
📚 詳細文檔
模型架構
Granite-speech-3.3-8b的架構由以下組件組成:
-
語音編碼器:包含10個Conformer塊,使用連接主義時序分類(CTC)在字符級目標上進行訓練。 | 配置參數 | 值 | | ---- | ---- | | 輸入維度 | 160 (80 logmels x 2) | | 層數 | 10 | | 隱藏維度 | 1024 | | 注意力頭數量 | 8 | | 注意力頭大小 | 128 | | 卷積核大小 | 15 | | 輸出維度 | 42 |
-
語音投影器和時間下采樣器(語音 - 文本模態適配器):使用2層窗口查詢變壓器(q-former),對來自語音編碼器最後一個Conformer塊的15個1024維聲學嵌入塊進行操作,通過每層每個塊使用3個可訓練查詢將其下采樣5倍。
-
大語言模型:採用具有128k上下文長度的granite-3.3-8b-instruct。
-
LoRA適配器:秩為64,應用於查詢、值投影矩陣。
訓練數據
訓練數據主要來自兩個關鍵來源:公開可用數據集和針對語音翻譯任務創建的合成數據。
名稱 | 任務 | 時長(小時) | 來源 |
---|---|---|---|
CommonVoice-17 English | ASR | 2600 | https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0 |
MLS English | ASR | 44000 | https://huggingface.co/datasets/facebook/multilingual_librispeech |
Librispeech | ASR | 1000 | https://huggingface.co/datasets/openslr/librispeech_asr |
VoxPopuli English | ASR | 500 | https://huggingface.co/datasets/facebook/voxpopuli |
AMI | ASR | 100 | https://huggingface.co/datasets/edinburghcstr/ami |
YODAS English | ASR | 10000 | https://huggingface.co/datasets/espnet/yodas |
Switchboard English | ASR | 260 | https://catalog.ldc.upenn.edu/LDC97S62 |
CallHome English | ASR | 18 | https://catalog.ldc.upenn.edu/LDC97T14 |
Fisher | ASR | 2000 | https://catalog.ldc.upenn.edu/LDC2004S13 |
Voicemail part I | ASR | 40 | https://catalog.ldc.upenn.edu/LDC98S77 |
Voicemail part II | ASR | 40 | https://catalog.ldc.upenn.edu/LDC2002S35 |
CommonVoice-17 En->De,Es,Fr,It,Ja,Pt,Zh | AST | 2600*7 | Translations with Phi-4 and MADLAD |
基礎設施
使用IBM的超級計算集群Blue Vela進行訓練,該集群配備NVIDIA H100 GPU,可在數千個GPU上提供可擴展且高效的訓練基礎設施。此特定模型在32個H100 GPU上訓練9天完成。
🔧 技術細節
評估
在標準基準測試中,將Granite-speech-3.3-8b與參數少於80億的其他語音語言模型(SLM)以及專用的ASR和AST系統進行了評估。評估涵蓋多個公共基準,特別側重於英語ASR任務,同時也包括英語到其他語言(En-X)的翻譯任務。
📄 許可證
本模型採用Apache 2.0許可證。
⚠️ 重要提示
- 模型在使用
num_beams=1
進行貪心解碼或處理極短音頻片段(<0.1s)時可能產生不可靠輸出。在進一步更新發布之前,建議使用大於1的束搜索大小,並避免輸入低於0.1秒的音頻,以確保更一致的性能。 - 大語音和語言模型的使用可能涉及風險和倫理考量,包括偏差和公平性、錯誤信息以及自主決策等問題。建議社區按照IBM的負責任使用指南或類似的負責任使用框架使用Granite-speech-3.3-8b。
💡 使用建議
為增強安全性,建議將Granite-speech-3.3-8b與Granite Guardian一起使用。Granite Guardian是一個經過微調的指令模型,旨在檢測和標記提示和響應中的風險。
📦 資源
- 📄 閱讀完整技術報告:https://arxiv.org/abs/2505.08699
- ⭐️ 瞭解Granite的最新更新:https://www.ibm.com/granite
- 🚀 獲取教程、最佳實踐和提示工程建議:https://www.ibm.com/granite/docs/
- 💡 瞭解最新的Granite學習資源:https://ibm.biz/granite-learning-resources











