granite-speech-3.3-8b開源語音模型 - 高效完成自動語音識別與翻譯

首頁

Granite Speech 3.3 8b

由ibm-granite開發

專為自動語音識別(ASR)和自動語音翻譯(AST)設計的緊湊高效語音語言模型，採用雙階段設計處理音頻和文本

文本生成音頻

Transformers

英語開源協議:Apache-2.0 #雙階段語音處理 #企業級語音翻譯 #低參數量高效

下載量 5,532

發布時間 : 4/14/2025

模型概述

基於Granite-3.3-8b-instruct適配的語音語言模型，擅長英語語音轉文本及英語到多語種的語音翻譯，採用模態對齊技術訓練

模型特點

雙階段處理設計

先轉寫音頻為文本，再通過底層語言模型處理文本，降低模態干擾風險

多任務支持

同時支持語音識別(ASR)和語音翻譯(AST)任務

高效架構

10層Conformer編碼器配合2層Transformer降採樣器，實現10倍時序壓縮

企業級優化

針對企業語音處理場景優化，尤其擅長英語及主流歐洲語言處理

模型能力

英語語音轉文本

英語到多語種語音翻譯

純文本任務處理

長音頻處理(支持128k上下文)

使用案例

語音轉錄

會議記錄自動化

將英語會議錄音即時轉寫為文字記錄

在CommonVoice-17測試集上達到SOTA水平

跨語言溝通

即時語音翻譯

英語到法語/西班牙語等語言的即時語音轉換

在IWSLT測試集上超越同類8B參數模型

🚀 Granite-speech-3.3-8b

Granite-speech-3.3-8b是一款緊湊高效的語音語言模型，專為自動語音識別（ASR）和自動語音翻譯（AST）而設計。它採用雙通設計，與將語音和語言處理整合為單通的集成模型不同。首次調用該模型時，它會將音頻文件轉錄為文本；若要使用底層的Granite語言模型處理轉錄後的文本，用戶需進行第二次調用，因為每個步驟都必須顯式啟動。

🚀 快速開始

Granite Speech模型在transformers庫的main分支中得到原生支持。以下是使用granite-speech-3.3-8b模型的簡單示例。

使用`transformers`庫

首先，確保從源代碼構建最新版本的transformers庫：

pip install https://github.com/huggingface/transformers/archive/main.zip torchaudio peft soundfile

然後運行以下代碼：

import torch
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from huggingface_hub import hf_hub_download

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "ibm-granite/granite-speech-3.3-8b"
speech_granite_processor = AutoProcessor.from_pretrained(
    model_name)
tokenizer = speech_granite_processor.tokenizer
speech_granite = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name).to(device)

# prepare speech and text prompt, using the appropriate prompt template

audio_path = hf_hub_download(repo_id=model_name, filename='10226_10111_000000.wav')
wav, sr = torchaudio.load(audio_path, normalize=True)
assert wav.shape[0] == 1 and sr == 16000 # mono, 16khz

# create text prompt
chat = [
    {
        "role": "system",
        "content": "Knowledge Cutoff Date: April 2024.\nToday's Date: April 9, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
    },
    {
        "role": "user",
        "content": "<|audio|>can you transcribe the speech into a written format?",
    }
]

text = tokenizer.apply_chat_template(
    chat, tokenize=False, add_generation_prompt=True
)

# compute audio embeddings
model_inputs = speech_granite_processor(
    text,
    wav,
    device=device, # Computation device; returned tensors are put on CPU
    return_tensors="pt",
).to(device)
 
model_outputs = speech_granite.generate(
    **model_inputs,
    max_new_tokens=200,
    num_beams=4,
    do_sample=False,
    min_length=1,
    top_p=1.0,
    repetition_penalty=1.0,
    length_penalty=1.0,
    temperature=1.0,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
)

# Transformers includes the input IDs in the response.
num_input_tokens = model_inputs["input_ids"].shape[-1]
new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0)

output_text = tokenizer.batch_decode(
    new_tokens, add_special_tokens=False, skip_special_tokens=True
)
print(f"STT output = {output_text[0].upper()}")

使用`vLLM`庫

首先，確保安裝最新版本的vLLM庫：

pip install vllm --upgrade

離線模式代碼

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset
from vllm.lora.request import LoRARequest

model_id = "ibm-granite/granite-speech-3.3-8b"
tokenizer = AutoTokenizer.from_pretrained(model_id)

def get_prompt(question: str, has_audio: bool):
    """Build the input prompt to send to vLLM."""
    if has_audio:
        question = f"<|audio|>{question}"
    chat = [
        {
            "role": "user",
            "content": question
        }
    ]
    return tokenizer.apply_chat_template(chat, tokenize=False)

# NOTE - you may see warnings about multimodal lora layers being ignored;
# this is okay as the lora in this model is only applied to the LLM.
model = LLM(
    model=model_id,
    enable_lora=True,
    max_lora_rank=64,
    max_model_len=2048, # This may be needed for lower resource devices.
    limit_mm_per_prompt={"audio": 1},
)

### 1. Example with Audio [make sure to use the lora]
question = "can you transcribe the speech into a written format?"
prompt_with_audio = get_prompt(
    question=question,
    has_audio=True,
)
audio = AudioAsset("mary_had_lamb").audio_and_sample_rate

inputs = {
    "prompt": prompt_with_audio,
    "multi_modal_data": {
        "audio": audio,
    }
}

outputs = model.generate(
    inputs,
    sampling_params=SamplingParams(
        temperature=0.2,
        max_tokens=64,
    ),
    lora_request=[LoRARequest("speech", 1, model_id)]
)
print(f"Audio Example - Question: {question}")
print(f"Generated text: {outputs[0].outputs[0].text}")


### 2. Example without Audio [do NOT use the lora]
question = "What is the capital of Brazil?"
prompt = get_prompt(
    question=question,
    has_audio=False,
)

outputs = model.generate(
    {"prompt": prompt},
    sampling_params=SamplingParams(
        temperature=0.2,
        max_tokens=12,
    ),
)
print(f"Text Only Example - Question: {question}")
print(f"Generated text: {outputs[0].outputs[0].text}")

在線模式代碼

"""
Launch the vLLM server with the following command:

vllm serve ibm-granite/granite-speech-3.3-8b \
    --api-key token-abc123 \
    --max-model-len 2048 \
    --enable-lora  \
    --lora-modules speech=ibm-granite/granite-speech-3.3-8b \
    --max-lora-rank 64
"""

import base64

import requests
from openai import OpenAI

from vllm.assets.audio import AudioAsset

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "token-abc123"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

base_model_name = "ibm-granite/granite-speech-3.3-8b"
lora_model_name = "speech"
# Any format supported by librosa is supported
audio_url = AudioAsset("mary_had_lamb").url

# Use base64 encoded audio in the payload
def encode_audio_base64_from_url(audio_url: str) -> str:
    """Encode an audio retrieved from a remote url to base64 format."""
    with requests.get(audio_url) as response:
        response.raise_for_status()
        result = base64.b64encode(response.content).decode('utf-8')
    return result

audio_base64 = encode_audio_base64_from_url(audio_url=audio_url)

### 1. Example with Audio
# NOTE: we pass the name of the lora model (`speech`) here because we have audio.
question = "can you transcribe the speech into a written format?"
chat_completion_with_audio = client.chat.completions.create(
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": question
            },
            {
                "type": "audio_url",
                "audio_url": {
                    # Any format supported by librosa is supported
                    "url": f"data:audio/ogg;base64,{audio_base64}"
                },
            },
        ],
    }],
    temperature=0.2,
    max_tokens=64,
    model=lora_model_name,
)


print(f"Audio Example - Question: {question}")
print(f"Generated text: {chat_completion_with_audio.choices[0].message.content}")


### 2. Example without Audio
# NOTE: we pass the name of the base model here because we do not have audio.
question = "What is the capital of Brazil?"
chat_completion_with_audio = client.chat.completions.create(
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": question
            },
        ],
    }],
    temperature=0.2,
    max_tokens=12,
    model=base_model_name,
)

print(f"Text Only Example - Question: {question}")
print(f"Generated text: {chat_completion_with_audio.choices[0].message.content}")

✨ 主要特性

緊湊高效：專為自動語音識別（ASR）和自動語音翻譯（AST）設計，採用雙通設計，提高處理效率。
多模態適配：通過語音編碼器、語音投影器和時間下采樣器等組件，實現語音和文本模態的有效適配。
可擴展性：基於IBM的超級計算集群進行訓練，具備可擴展的訓練基礎設施。

📦 安裝指南

使用transformers庫時，需從源代碼構建最新版本：

pip install https://github.com/huggingface/transformers/archive/main.zip torchaudio peft soundfile

使用vLLM庫時，需安裝最新版本：

pip install vllm --upgrade

📚 詳細文檔

模型架構

Granite-speech-3.3-8b的架構由以下組件組成：

語音編碼器：包含10個Conformer塊，使用連接主義時序分類（CTC）在字符級目標上進行訓練。 | 配置參數 | 值 | | ---- | ---- | | 輸入維度 | 160 (80 logmels x 2) | | 層數 | 10 | | 隱藏維度 | 1024 | | 注意力頭數量 | 8 | | 注意力頭大小 | 128 | | 卷積核大小 | 15 | | 輸出維度 | 42 |
語音投影器和時間下采樣器（語音 - 文本模態適配器）：使用2層窗口查詢變壓器（q-former），對來自語音編碼器最後一個Conformer塊的15個1024維聲學嵌入塊進行操作，通過每層每個塊使用3個可訓練查詢將其下采樣5倍。
大語言模型：採用具有128k上下文長度的granite-3.3-8b-instruct。
LoRA適配器：秩為64，應用於查詢、值投影矩陣。

訓練數據

訓練數據主要來自兩個關鍵來源：公開可用數據集和針對語音翻譯任務創建的合成數據。

名稱	任務	時長（小時）	來源
CommonVoice-17 English	ASR	2600	https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0
MLS English	ASR	44000	https://huggingface.co/datasets/facebook/multilingual_librispeech
Librispeech	ASR	1000	https://huggingface.co/datasets/openslr/librispeech_asr
VoxPopuli English	ASR	500	https://huggingface.co/datasets/facebook/voxpopuli
AMI	ASR	100	https://huggingface.co/datasets/edinburghcstr/ami
YODAS English	ASR	10000	https://huggingface.co/datasets/espnet/yodas
Switchboard English	ASR	260	https://catalog.ldc.upenn.edu/LDC97S62
CallHome English	ASR	18	https://catalog.ldc.upenn.edu/LDC97T14
Fisher	ASR	2000	https://catalog.ldc.upenn.edu/LDC2004S13
Voicemail part I	ASR	40	https://catalog.ldc.upenn.edu/LDC98S77
Voicemail part II	ASR	40	https://catalog.ldc.upenn.edu/LDC2002S35
CommonVoice-17 En->De,Es,Fr,It,Ja,Pt,Zh	AST	2600*7	Translations with Phi-4 and MADLAD

基礎設施

使用IBM的超級計算集群Blue Vela進行訓練，該集群配備NVIDIA H100 GPU，可在數千個GPU上提供可擴展且高效的訓練基礎設施。此特定模型在32個H100 GPU上訓練9天完成。

🔧 技術細節

評估

在標準基準測試中，將Granite-speech-3.3-8b與參數少於80億的其他語音語言模型（SLM）以及專用的ASR和AST系統進行了評估。評估涵蓋多個公共基準，特別側重於英語ASR任務，同時也包括英語到其他語言（En-X）的翻譯任務。

image/png

📄 許可證

本模型採用Apache 2.0許可證。

⚠️ 重要提示

模型在使用num_beams=1進行貪心解碼或處理極短音頻片段（<0.1s）時可能產生不可靠輸出。在進一步更新發布之前，建議使用大於1的束搜索大小，並避免輸入低於0.1秒的音頻，以確保更一致的性能。
大語音和語言模型的使用可能涉及風險和倫理考量，包括偏差和公平性、錯誤信息以及自主決策等問題。建議社區按照IBM的負責任使用指南或類似的負責任使用框架使用Granite-speech-3.3-8b。