Ultravox Open-source Model for Multilingual Audio-to-Text - Supports Speech Recognition and Transcription in Multiple Languages

Ultravox V0 5 Llama 3 2 1b ONNX

Developed by onnx-community

Ultravox is a multilingual audio-to-text model optimized based on the LLaMA-3-2.1B architecture, supporting speech recognition and transcription tasks in multiple languages.

Audio-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Multilingual audio transcription #Real-time speech processing #Conversational AI integration

Downloads 1,088

Release Time : 2/19/2025

Model Overview

This model focuses on audio-to-text conversion tasks, capable of processing speech input in multiple languages and generating accurate text transcriptions.

Model Features

Multilingual support

Supports audio transcription in over 40 languages, including various European, Asian, and African languages.

Efficient quantization

Provides multiple quantization options (q8, q4, etc.), reducing model size and computational requirements while maintaining performance.

Conversational transcription

Capable of understanding context and generating transcription results suitable for conversational scenarios, not just word-for-word transcription.

Model Capabilities

Audio transcription

Multilingual speech recognition

Conversational text generation

Real-time speech processing

Use Cases

Meeting minutes

Multilingual meeting transcription

Automatically transcribes multilingual meeting recordings into text, supporting subsequent translation and analysis.

Accurately identifies speech content from different speakers and generates structured meeting minutes.

Media production

Video subtitle generation

Automatically generates subtitles for multilingual video content.

Improves video accessibility and reduces manual subtitle production costs.

Customer service

Voice customer service recording

Automatically records and analyzes customer service call content.

Facilitates quality monitoring and customer needs analysis.

🚀 Transformers.js

Transformers.js is a JavaScript library that enables you to use pre - trained models for various NLP and audio - text tasks. It provides an easy - to - use interface for tasks like audio - text transcription.

🚀 Quick Start

To get started with Transformers.js, you first need to install the library. You can install it from NPM using the following command:

npm i @huggingface/transformers

💻 Usage Examples

Basic Usage

Once the library is installed, you can use the model as shown in the following example:

import { UltravoxProcessor, UltravoxModel, read_audio } from "@huggingface/transformers";

const processor = await UltravoxProcessor.from_pretrained(
  "onnx-community/ultravox-v0_5-llama-3_2-1b-ONNX",
);
const model = await UltravoxModel.from_pretrained(
  "onnx-community/ultravox-v0_5-llama-3_2-1b-ONNX",
  {
    dtype: {
      embed_tokens: "q8", // "fp32", "fp16", "q8"
      audio_encoder: "q4", // "fp32", "fp16", "q8", "q4", "q4f16"
      decoder_model_merged: "q4", // "q8", "q4", "q4f16"
    },
  },
);

const audio = await read_audio("http://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/mlk.wav", 16000);
const messages = [
  {
    role: "system",
    content: "You are a helpful assistant.",
  },
  { role: "user", content: "Transcribe this audio:<|audio|>" },
];
const text = processor.tokenizer.apply_chat_template(messages, {
  add_generation_prompt: true,
  tokenize: false,
});

const inputs = await processor(text, audio);
const generated_ids = await model.generate({
  ...inputs,
  max_new_tokens: 128,
});

const generated_texts = processor.batch_decode(
  generated_ids.slice(null, [inputs.input_ids.dims.at(-1), null]),
  { skip_special_tokens: true },
);
console.log(generated_texts[0]);
// "I can transcribe the audio for you. Here's the transcription:\n\n\"I have a dream that one day this nation will rise up and live out the true meaning of its creed.\"\n\n- Martin Luther King Jr.\n\nWould you like me to provide the transcription in a specific format (e.g., word-for-word, character-for-character, or a specific font)?"

📄 License

This project is licensed under the MIT license.

📚 Documentation

Supported Languages

The library supports the following languages:

Metrics

The following metrics are used:

bleu

Pipeline Tag

The pipeline tag is: audio - text - to - text

Base Model

The base model used is:

fixie - ai/ultravox - v0_5 - llama - 3_2 - 1b

Property	Details
Supported Languages	ar, be, bg, bn, cs, cy, da, de, el, en, es, et, fa, fi, fr, gl, hi, hu, it, ja, ka, lt, lv, mk, mr, nl, pl, pt, ro, ru, sk, sl, sr, sv, sw, ta, th, tr, uk, ur, vi, zh
Metrics	bleu
Pipeline Tag	audio - text - to - text
Base Model	fixie - ai/ultravox - v0_5 - llama - 3_2 - 1b

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご