OmniAudio-2.6B Open-Source Audio Language Model - Efficient Deployment on the Edge, Supports Text and Audio Input

Omniaudio 2.6B

Developed by NexaAIDev

The world's fastest and most efficient edge-deployable audio language model, a 2.6B parameter multimodal model capable of processing both text and audio inputs.

Audio-to-Text EnglishOpen Source License:Apache-2.0 #Edge-side audio processing #Low-latency dialogue #Offline voice Q&A

Downloads 1,149

Release Time : 12/11/2024

Model Overview

OmniAudio-2.6B is an efficient multimodal model that integrates Gemma-2-2b, Whisper turbo, and custom projection modules, enabling secure and responsive audio-text processing directly on edge devices.

Model Features

Edge-optimized deployment

Specially optimized for edge devices to achieve minimal latency and resource overhead.

Unified multimodal architecture

Integrates ASR and LLM capabilities within a single architecture, avoiding performance bottlenecks of traditional cascaded solutions.

Exceptional inference speed

Delivers 5.5x to 10.3x performance improvement on consumer-grade hardware.

Model Capabilities

Audio-text conversion

Voice dialogue

Creative content generation

Audio summarization

Voice tone adjustment

Use Cases

Offline voice interaction

Offline queries

Process voice queries in no-network environments, such as camping fire-starting instructions

Provides practical guidance

Voice assistant

Emotional support dialogue

Offers supportive responses to users' expressed emotions

Active listening and response

Content creation

Voice-to-poetry

Transforms voice prompts into creative works

Generates poetic responses

Office productivity

Meeting recording summaries

Converts lengthy recordings into concise summaries

Actionable summaries

🚀 OmniAudio-2.6B

OmniAudio is the world's fastest and most efficient audio - language model for on - device deployment. It's a 2.6B - parameter multimodal model that can process both text and audio inputs.

Example

OmniAudio is the world's fastest and most efficient audio - language model for on - device deployment - a 2.6B - parameter multimodal model that processes both text and audio inputs. It integrates three components: Gemma - 2 - 2b, Whisper turbo, and a custom projector module, enabling secure, responsive audio - text processing directly on edge devices.

Unlike traditional approaches that chain ASR and LLM models together, OmniAudio - 2.6B unifies both capabilities in a single efficient architecture for minimal latency and resource overhead.

🚀 Quick Start

Quick Links

Interactive Demo in our [HuggingFace Space](https://huggingface.co/spaces/NexaAIDev/omni - audio - demo)
[Quickstart for local setup](#how - to - use - on - device)
Learn more in our [Blogs](https://nexa.ai/blogs/OmniAudio - 2.6B)

Feedback: Send questions or suggestions about the model in our [Discord](https://discord.gg/nexa - ai)

Demo

✨ Features

Performance Benchmarks on Consumer Hardware

On a 2024 Mac Mini M4 Pro, Qwen2 - Audio - 7B - Instruct running on 🤗 Transformers achieves an average decoding speed of 6.38 tokens/second, while Omni - Audio - 2.6B through Nexa SDK reaches 35.23 tokens/second in FP16 GGUF version and 66 tokens/second in Q4_K_M quantized GGUF version - delivering 5.5x to 10.3x faster performance on consumer hardware.

Use Cases

Voice QA without Internet: Process offline voice queries like "I am at camping, how do I start a fire without fire starter?" OmniAudio provides practical guidance even without network connectivity.
Voice - in Conversation: Have conversations about personal experiences. When you say "I am having a rough day at work," OmniAudio engages in supportive talk and active listening.
Creative Content Generation: Transform voice prompts into creative pieces. Ask "Write a haiku about autumn leaves" and receive poetic responses inspired by your voice input.
Recording Summary: Simply ask "Can you summarize this meeting note?" to convert lengthy recordings into concise, actionable summaries.
Voice Tone Modification: Transform casual voice memos into professional communications. When you request "Can you make this voice memo more professional?" OmniAudio adjusts the tone while preserving the core message.

📦 Installation

How to Use On Device

Step 1: Install Nexa - SDK (local on - device inference framework)

[🚀 Install Nexa - SDK](https://github.com/NexaAI/nexa - sdk?tab=readme - ov - file#install - option - 1 - executable - installer)

Nexa - SDK is a open - sourced, local on - device inference framework, supporting text generation, image generation, vision - language models (VLM), audio - language models, speech - to - text (ASR), and text - to - speech (TTS) capabilities. Installable via Python Package or Executable Installer.

Step 2: Then run the following code in your terminal

nexa run omniaudio -st

💻 OmniAudio - 2.6B q4_K_M version requires 1.30GB RAM and 1.60GB storage space.

🔧 Technical Details

Training

We developed OmniAudio through a three - stage training pipeline:

Pretraining: The initial stage focuses on core audio - text alignment using MLS English 10k transcription dataset. We introduced a special <|transcribe|> token to enable the model to distinguish between transcription and completion tasks, ensuring consistent performance across use cases.
Supervised Fine - Tuning (SFT): We enhance the model's conversation capabilities using synthetic datasets derived from MLS English 10k transcription. This stage leverages a proprietary model to generate contextually appropriate responses, creating rich audio - text pairs for effective dialogue understanding.
Direct Preference Optimization (DPO): The final stage refines model quality using GPT - 4o API as a reference. The process identifies and corrects inaccurate responses while maintaining semantic alignment. We additionally leverage Gemma2's text responses as a gold standard to ensure consistent quality across both audio and text inputs.

📚 Documentation

What's Next for OmniAudio?

OmniAudio is in active development and we are working to advance its capabilities:

Building direct audio generation for two - way voice communication
Implementing function calling support via [Octopus_v2](https://huggingface.co/NexaAIDev/Octopus - v2) integration

In the long term, we aim to establish OmniAudio as a comprehensive solution for edge - based audio - language processing.

Join Community

[Discord](https://discord.gg/nexa - ai) | X(Twitter)

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご