Voxtral Mini Open-Source Audio AI Model - Free Deployment for Speech Transcription, Translation, and Understanding

Voxtral Mini 3B 2507 Transformers

Developed by MohamedRashad

Voxtral Mini is an enhanced version based on Ministral 3B, with advanced audio input capabilities and excellent performance in speech transcription, translation, and audio understanding.

Audio-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Audio understanding #Multilingual transcription #Long context processing

Downloads 416

Release Time : 7/18/2025

Model Overview

Voxtral Mini is a multimodal model that combines text and audio processing capabilities. It retains the text processing capabilities of Ministral 3B while adding powerful audio understanding functions.

Model Features

Dedicated transcription mode

It can run in pure speech transcription mode, automatically identifying the source audio language and performing text transcription.

Long context processing

Supports a context length of 32k tokens and can process audio up to 30 - 40 minutes long.

Built-in Q&A and summarization functions

Supports directly asking questions through audio and generating structured summaries without the need for separate ASR and language models.

Native multilingual support

Automatically detects and supports audio processing in 8 major languages.

Direct voice call function

Can directly trigger backend functions, workflows, or API calls based on voice intent.

Model Capabilities

Speech transcription

Audio understanding

Multilingual support

Long audio processing

Text generation

Q&A system

Summary generation

Multi-round dialogue

Use Cases

Voice processing

Meeting record transcription

Automatically transcribes a 30 - minute meeting recording into text.

Transcription text with high accuracy

Multilingual voice translation

Realtime translates the voice of one language into the text of another language.

Supports mutual translation among 8 major languages

Audio analysis

Audio content understanding

Directly asks questions about the audio content and obtains answers.

Understands audio content without prior transcription

Audio summary generation

Analyzes long audio and generates structured summaries.

Saves manual sorting time

🚀 Voxtral Mini 3B - 2507 (Transformers Edition)

Voxtral Mini enhances Ministral 3B by integrating cutting - edge audio input capabilities while maintaining top - notch text performance. It shines in speech transcription, translation, and audio understanding.

Learn more about Voxtral in our blog post here.

✨ Features

Voxtral builds upon Ministral - 3B with powerful audio understanding capabilities:

Dedicated transcription mode: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, it automatically predicts the source audio language and transcribes the text accordingly.
Long - form context: With a 32k token context length, Voxtral can handle audios up to 30 minutes for transcription or 40 minutes for understanding.
Built - in Q&A and summarization: Supports asking questions directly through audio. It can analyze audio and generate structured summaries without the need for separate ASR and language models.
Natively multilingual: Automatic language detection and state - of - the - art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian).
Function - calling straight from voice: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents.
Highly capable at text: Retains the text understanding capabilities of its language model backbone, Ministral - 3B.

📈 Benchmark Results

Audio

Average word error rate (WER) over the FLEURS, Mozilla Common Voice and Multilingual LibriSpeech benchmarks:

image/png

Text

image/png

📦 Installation

Install Transformers from source:

pip install git+https://github.com/huggingface/transformers

💻 Usage Examples

Basic Usage

The model can be used with the following frameworks:

Transformers 🤗: See here

Notes:

temperature = 0.2 and top_p = 0.95 for chat completion (e.g. Audio Understanding) and temperature = 0.0 for transcription.
Multiple audios per message and multiple user turns with audio are supported.
System prompts are not yet supported.

Advanced Usage

Transformers 🤗

Voxtral is supported in Transformers natively!

Audio Instruct

➡️ multi - audio + text instruction

from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch

device = "cuda"
repo_id = "MohamedRashad/Voxtral-Mini-3B-2507-transformers"

processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "audio",
                "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/mary_had_lamb.mp3",
            },
            {
                "type": "audio",
                "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
            },
            {"type": "text", "text": "What sport and what nursery rhyme are referenced?"},
        ],
    }
]

inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)

outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)

print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)

➡️ multi - turn

from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch

device = "cuda"
repo_id = "MohamedRashad/Voxtral-Mini-3B-2507-transformers"

processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "audio",
                "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
            },
            {
                "type": "audio",
                "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
            },
            {"type": "text", "text": "Describe briefly what you can hear."},
        ],
    },
    {
        "role": "assistant",
        "content": "The audio begins with the speaker delivering a farewell address in Chicago, reflecting on his eight years as president and expressing gratitude to the American people. The audio then transitions to a weather report, stating that it was 35 degrees in Barcelona the previous day, but the temperature would drop to minus 20 degrees the following day.",
    },
    {
        "role": "user",
        "content": [
            {
                "type": "audio",
                "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
            },
            {"type": "text", "text": "Ok, now compare this new audio with the previous one."},
        ],
    },
]

inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)

outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)

print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)

➡️ text only

from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch

device = "cuda"
repo_id = "MohamedRashad/Voxtral-Mini-3B-2507-transformers"

processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "Why should AI models be open - sourced?",
            },
        ],
    }
]

inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)

outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)

print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)

➡️ audio only

from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch

device = "cuda"
repo_id = "MohamedRashad/Voxtral-Mini-3B-2507-transformers"

processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "audio",
                "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
            },
        ],
    }
]

inputs = processor.apply_chat_template(conversation)
inputs = inputs.to(device, dtype=torch.bfloat16)

outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)

print("\nGenerated response:")
print("=" * 80)
print(decoded_outputs[0])
print("=" * 80)

➡️ batched inference

from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch

device = "cuda"
repo_id = "MohamedRashad/Voxtral-Mini-3B-2507-transformers"

processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)

conversations = [
    [
        {
            "role": "user",
            "content": [
                {
                    "type": "audio",
                    "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
                },
                {
                    "type": "audio",
                    "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
                },
                {
                    "type": "text",
                    "text": "Who's speaking in the speach and what city's weather is being discussed?",
                },
            ],
        }
    ],
    [
        {
            "role": "user",
            "content": [
                {
                    "type": "audio",
                    "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
                },
                {"type": "text", "text": "What can you tell me about this audio?"},
            ],
        }
    ],
]

inputs = processor.apply_chat_template(conversations)
inputs = inputs.to(device, dtype=torch.bfloat16)

outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)

print("\nGenerated responses:")
print("=" * 80)
for decoded_output in decoded_outputs:
    print(decoded_output)
    print("=" * 80)

Transcription

➡️ transcribe

from transformers import VoxtralForConditionalGeneration, AutoProcessor
import torch

device = "cuda"
repo_id = "MohamedRashad/Voxtral-Mini-3B-2507-transformers"

processor = AutoProcessor.from_pretrained(repo_id)
model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)

inputs = processor.apply_transcrition_request(language="en", audio="https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3", model_id=repo_id)
inputs = inputs.to(device, dtype=torch.bfloat16)

outputs = model.generate(**inputs, max_new_tokens=500)
decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)

print("\nGenerated responses:")
print("=" * 80)
for decoded_output in decoded_outputs:
    print(decoded_output)
    print("=" * 80)

📄 License

This project is licensed under the Apache - 2.0 license.

⚠️ Important Note

If you want to learn more about how we process your personal data, please read our Privacy Policy.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご