Granite-speech-3.3-8b Open-source Speech Model - Efficiently Complete Automatic Speech Recognition and Translation

Granite Speech 3.3 8b

Developed by ibm-granite

A compact and efficient speech-language model designed for Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST), featuring a two-stage design for processing audio and text

Text-to-Audio

Transformers

EnglishOpen Source License:Apache-2.0 #Two-stage speech processing #Enterprise-grade speech translation #Low-parameter efficiency

Downloads 5,532

Release Time : 4/14/2025

Model Overview

A speech-language model adapted from Granite-3.3-8b-instruct, excelling in English speech-to-text and English-to-multilingual speech translation, trained with modality alignment techniques

Model Features

Two-stage processing design

First transcribes audio into text, then processes the text through the underlying language model, reducing the risk of modality interference

Multi-task support

Simultaneously supports both speech recognition (ASR) and speech translation (AST) tasks

Efficient architecture

10-layer Conformer encoder combined with a 2-layer Transformer downsampler achieves 10x temporal compression

Enterprise-grade optimization

Optimized for enterprise speech processing scenarios, particularly excels in English and major European languages

Model Capabilities

English speech-to-text

English-to-multilingual speech translation

Plain text task processing

Long audio processing (supports 128k context)

Use Cases

Speech transcription

Meeting minutes automation

Real-time transcription of English meeting recordings into text records

Achieves SOTA performance on the CommonVoice-17 test set

Cross-language communication

Real-time speech translation

Real-time conversion of English to French/Spanish and other languages

Outperforms similar 8B-parameter models on the IWSLT test set

🚀 Granite-speech-3.3-8b

Granite-speech-3.3-8b is a compact and efficient speech-language model designed for automatic speech recognition (ASR) and automatic speech translation (AST).

🚀 Quick Start

Granite-speech-3.3-8b is a compact and efficient speech-language model, specifically designed for automatic speech recognition (ASR) and automatic speech translation (AST). It uses a two-pass design. Initial calls will transcribe audio files into text, and to process the transcribed text using the underlying Granite language model, users must make a second call.

✨ Features

Compact and Efficient: Specifically designed for ASR and AST.
Two - Pass Design: Transcribes audio first and then processes the text separately.
Multilingual Support: Suitable for English speech - to - text and speech translations from English to major European languages, Japanese, and Mandarin.

📦 Installation

Usage with `transformers`

First, make sure to build the latest version of transformers from source:

pip install https://github.com/huggingface/transformers/archive/main.zip torchaudio peft soundfile

Usage with `vLLM`

First, make sure to install the latest version of vLLM:

pip install vllm --upgrade

💻 Usage Examples

Basic Usage

Usage with `transformers`

import torch
import torchaudio
from transformers import AutoProcessor, AutoModelForSpeechSeq2Seq
from huggingface_hub import hf_hub_download

device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "ibm-granite/granite-speech-3.3-8b"
speech_granite_processor = AutoProcessor.from_pretrained(
    model_name)
tokenizer = speech_granite_processor.tokenizer
speech_granite = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_name).to(device)

# prepare speech and text prompt, using the appropriate prompt template

audio_path = hf_hub_download(repo_id=model_name, filename='10226_10111_000000.wav')
wav, sr = torchaudio.load(audio_path, normalize=True)
assert wav.shape[0] == 1 and sr == 16000 # mono, 16khz

# create text prompt
chat = [
    {
        "role": "system",
        "content": "Knowledge Cutoff Date: April 2024.\nToday's Date: April 9, 2025.\nYou are Granite, developed by IBM. You are a helpful AI assistant",
    },
    {
        "role": "user",
        "content": "<|audio|>can you transcribe the speech into a written format?",
    }
]

text = tokenizer.apply_chat_template(
    chat, tokenize=False, add_generation_prompt=True
)

# compute audio embeddings
model_inputs = speech_granite_processor(
    text,
    wav,
    device=device, # Computation device; returned tensors are put on CPU
    return_tensors="pt",
).to(device)
 
model_outputs = speech_granite.generate(
    **model_inputs,
    max_new_tokens=200,
    num_beams=4,
    do_sample=False,
    min_length=1,
    top_p=1.0,
    repetition_penalty=1.0,
    length_penalty=1.0,
    temperature=1.0,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id,
)

# Transformers includes the input IDs in the response.
num_input_tokens = model_inputs["input_ids"].shape[-1]
new_tokens = torch.unsqueeze(model_outputs[0, num_input_tokens:], dim=0)

output_text = tokenizer.batch_decode(
    new_tokens, add_special_tokens=False, skip_special_tokens=True
)
print(f"STT output = {output_text[0].upper()}")

Advanced Usage

Usage with `vLLM`

Code for offline mode

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset
from vllm.lora.request import LoRARequest

model_id = "ibm-granite/granite-speech-3.3-8b"
tokenizer = AutoTokenizer.from_pretrained(model_id)

def get_prompt(question: str, has_audio: bool):
    """Build the input prompt to send to vLLM."""
    if has_audio:
        question = f"<|audio|>{question}"
    chat = [
        {
            "role": "user",
            "content": question
        }
    ]
    return tokenizer.apply_chat_template(chat, tokenize=False)

# NOTE - you may see warnings about multimodal lora layers being ignored;
# this is okay as the lora in this model is only applied to the LLM.
model = LLM(
    model=model_id,
    enable_lora=True,
    max_lora_rank=64,
    max_model_len=2048, # This may be needed for lower resource devices.
    limit_mm_per_prompt={"audio": 1},
)

### 1. Example with Audio [make sure to use the lora]
question = "can you transcribe the speech into a written format?"
prompt_with_audio = get_prompt(
    question=question,
    has_audio=True,
)
audio = AudioAsset("mary_had_lamb").audio_and_sample_rate

inputs = {
    "prompt": prompt_with_audio,
    "multi_modal_data": {
        "audio": audio,
    }
}

outputs = model.generate(
    inputs,
    sampling_params=SamplingParams(
        temperature=0.2,
        max_tokens=64,
    ),
    lora_request=[LoRARequest("speech", 1, model_id)]
)
print(f"Audio Example - Question: {question}")
print(f"Generated text: {outputs[0].outputs[0].text}")


### 2. Example without Audio [do NOT use the lora]
question = "What is the capital of Brazil?"
prompt = get_prompt(
    question=question,
    has_audio=False,
)

outputs = model.generate(
    {"prompt": prompt},
    sampling_params=SamplingParams(
        temperature=0.2,
        max_tokens=12,
    ),
)
print(f"Text Only Example - Question: {question}")
print(f"Generated text: {outputs[0].outputs[0].text}")

Code for online mode

"""
Launch the vLLM server with the following command:

vllm serve ibm-granite/granite-speech-3.3-8b \
    --api-key token-abc123 \
    --max-model-len 2048 \
    --enable-lora  \
    --lora-modules speech=ibm-granite/granite-speech-3.3-8b \
    --max-lora-rank 64
"""

import base64

import requests
from openai import OpenAI

from vllm.assets.audio import AudioAsset

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "token-abc123"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

base_model_name = "ibm-granite/granite-speech-3.3-8b"
lora_model_name = "speech"
# Any format supported by librosa is supported
audio_url = AudioAsset("mary_had_lamb").url

# Use base64 encoded audio in the payload
def encode_audio_base64_from_url(audio_url: str) -> str:
    """Encode an audio retrieved from a remote url to base64 format."""
    with requests.get(audio_url) as response:
        response.raise_for_status()
        result = base64.b64encode(response.content).decode('utf-8')
    return result

audio_base64 = encode_audio_base64_from_url(audio_url=audio_url)

### 1. Example with Audio
# NOTE: we pass the name of the lora model (`speech`) here because we have audio.
question = "can you transcribe the speech into a written format?"
chat_completion_with_audio = client.chat.completions.create(
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": question
            },
            {
                "type": "audio_url",
                "audio_url": {
                    # Any format supported by librosa is supported
                    "url": f"data:audio/ogg;base64,{audio_base64}"
                },
            },
        ],
    }],
    temperature=0.2,
    max_tokens=64,
    model=lora_model_name,
)


print(f"Audio Example - Question: {question}")
print(f"Generated text: {chat_completion_with_audio.choices[0].message.content}")


### 2. Example without Audio
# NOTE: we pass the name of the base model here because we do not have audio.
question = "What is the capital of Brazil?"
chat_completion_with_audio = client.chat.completions.create(
    messages=[{
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": question
            },
        ],
    }],
    temperature=0.2,
    max_tokens=12,
    model=base_model_name,
)

print(f"Text Only Example - Question: {question}")
print(f"Generated text: {chat_completion_with_audio.choices[0].message.content}")

📚 Documentation

Model Architecture

The architecture of granite-speech-3.3-8b consists of the following components:

(1) Speech encoder: 10 conformer blocks trained with Connectionist Temporal Classification (CTC) on character-level targets on the subset containing only ASR corpora (see configuration below). In addition, our CTC encoder uses block-attention with 4-seconds audio blocks and self-conditioned CTC from the middle layer.

Property	Details
Input dimension	160 (80 logmels x 2)
Nb. of layers	10
Hidden dimension	1024
Nb. of attention heads	8
Attention head size	128
Convolution kernel size	15
Output dimension	42

(2) Speech projector and temporal downsampler (speech-text modality adapter): we use a 2-layer window query transformer (q-former) operating on blocks of 15 1024-dimensional acoustic embeddings coming out of the last conformer block of the speech encoder that get downsampled by a factor of 5 using 3 trainable queries per block and per layer. The total temporal downsampling factor is 10 (2x from the encoder and 5x from the projector) resulting in a 10Hz acoustic embeddings rate for the LLM. The encoder, projector and LoRA adapters were fine-tuned/trained jointly on all the corpora mentioned under Training Data.

(3) Large language model: granite-3.3-8b-instruct with 128k context length (https://huggingface.co/ibm-granite/granite-3.3-8b-instruct).

(4) LoRA adapters: rank=64 applied to the query, value projection matrices

Training Data

Overall, our training data is largely comprised of two key sources: (1) publicly available datasets (2) Synthetic data created from publicly available datasets specifically targeting the speech translation task. A detailed description of the training datasets can be found in the table below:

Name	Task	Nb. hours	Source
CommonVoice-17 English	ASR	2600	https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0
MLS English	ASR	44000	https://huggingface.co/datasets/facebook/multilingual_librispeech
Librispeech	ASR	1000	https://huggingface.co/datasets/openslr/librispeech_asr
VoxPopuli English	ASR	500	https://huggingface.co/datasets/facebook/voxpopuli
AMI	ASR	100	https://huggingface.co/datasets/edinburghcstr/ami
YODAS English	ASR	10000	https://huggingface.co/datasets/espnet/yodas
Switchboard English	ASR	260	https://catalog.ldc.upenn.edu/LDC97S62
CallHome English	ASR	18	https://catalog.ldc.upenn.edu/LDC97T14
Fisher	ASR	2000	https://catalog.ldc.upenn.edu/LDC2004S13
Voicemail part I	ASR	40	https://catalog.ldc.upenn.edu/LDC98S77
Voicemail part II	ASR	40	https://catalog.ldc.upenn.edu/LDC2002S35
CommonVoice-17 En->De,Es,Fr,It,Ja,Pt,Zh	AST	2600*7	Translations with Phi-4 and MADLAD

Infrastructure

We train Granite Speech using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. The training of this particular model was completed in 9 days on 32 H100 GPUs.

Ethical Considerations and Limitations

⚠️ Important Note

Users should be aware that the model may produce unreliable outputs when decoding with num_beams=1 or when processing extremely short audio clips (<0.1s). Until further updates are released, we recommend using beam sizes greater than 1 and avoiding inputs below the 0.1-second threshold to ensure more consistent performance.

💡 Usage Tip

The use of Large Speech and Language Models may involve risks and ethical considerations that people should be aware of. These risks may include bias and fairness, misinformation, and autonomous decision-making. We urge the community to use granite-speech-3.3-8b in a manner consistent with IBM's Responsible Use Guide or similar responsible use structures. IBM recommends using this model for automatic speech recognition tasks. The model's modular design improves safety by limiting how audio inputs can influence the system. If an unfamiliar or malformed prompt is received, the model simply echoes it with its transcription. This minimizes the risk of adversarial inputs, unlike integrated models that directly interpret audio and may be more exposed to such attacks. Note that more general speech tasks may pose higher inherent risks of triggering unwanted outputs. To enhance safety, we recommend using granite-speech-3.3-8b alongside Granite Guardian. Granite Guardian is a fine-tuned instruct model designed to detect and flag risks in prompts and responses across key dimensions outlined in the IBM AI Risk Atlas. Its training, which includes both human-annotated and synthetic data informed by internal red-teaming, enables it to outperform similar open-source models on standard benchmarks, providing an additional layer of safety.

🔧 Technical Details

We are currently investigating an issue with greedy decoding (num_beams=1); the model performs reliably with beam sizes > 1, which we recommend for all use cases. Additionally, the model may occasionally hallucinate on very short audio inputs (<0.1s). These issues are under active investigation, and we will update guidance as fixes become available.

📄 License

This project is licensed under the Apache 2.0 license.

Resources

📄 Read the full technical report: https://arxiv.org/abs/2505.08699
⭐️ Learn about the latest updates with Granite: https://www.ibm.com/granite
🚀 Get started with tutorials, best practices, and prompt engineering advice: https://www.ibm.com/granite/docs/
💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Granite Speech 3.3 8b

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 Granite-speech-3.3-8b

🚀 Quick Start

✨ Features

📦 Installation

Usage with transformers

Usage with vLLM

💻 Usage Examples

Basic Usage

Usage with transformers

Advanced Usage

Usage with vLLM

Code for offline mode

Code for online mode

📚 Documentation

Model Architecture

Training Data

Infrastructure

Ethical Considerations and Limitations

🔧 Technical Details

📄 License

Resources

Usage with `transformers`

Usage with `vLLM`

Usage with `transformers`

Usage with `vLLM`