Kimi-Audio-7B-Instruct Open-Source Audio Model - Free Support for Multi-Tasks Such as Comprehension, Generation, and Conversation

Kimi Audio 7B Instruct

Developed by moonshotai

Kimi Audio is an open-source audio foundation model that excels in audio understanding, generation, and dialogue, supporting various audio processing tasks.

Speech Recognition

Safetensors

Supports Multiple LanguagesOpen Source License:MIT #Multimodal Audio Processing #End-to-End Speech Dialogue #Ultra-Large Scale Pre-training

Downloads 1,626

Release Time : 4/25/2025

Model Overview

Kimi Audio is a general-purpose audio foundation model capable of handling multiple audio processing tasks under a unified framework, including speech recognition, audio question answering, audio captioning, speech emotion recognition, and more.

Model Features

Versatile Processing Capabilities

Supports various audio processing tasks, including speech recognition, audio question answering, audio captioning, speech emotion recognition, and more.

Top-Tier Performance

Achieves state-of-the-art results on multiple audio benchmarks.

Ultra-Large Scale Pre-training

Trained on over 13 million hours of diverse audio data (speech/music/environmental sounds) and text data.

Innovative Architecture Design

Utilizes a hybrid audio input and parallel text/audio token generation architecture with a large language model core.

Efficient Inference Deployment

Equipped with a stream-matching-based block streaming decoder for low-latency audio generation.

Model Capabilities

Audio Understanding

Audio Generation

Speech Recognition

Audio Question Answering

Audio Captioning

Speech Emotion Recognition

Acoustic Event Classification

Acoustic Scene Classification

End-to-End Speech Dialogue

Use Cases

Speech Recognition

Audio-to-Text Conversion

Convert audio files into text content.

Highly accurate text output.

Multimodal Dialogue

Audio Dialogue Generation

Generate dialogue responses based on input audio.

Produces natural dialogue audio and text.

🚀 Kimi-Audio

Kimi-Audio is an open - source audio foundation model that excels in audio understanding, generation, and conversation, offering a unified framework for various audio processing tasks.

🚀 Quick Start

This example demonstrates basic usage for generating text from audio (ASR) and generating both text and speech in a conversational turn using the Kimi - Audio - 7B - Instruct model.

import soundfile as sf
# Assuming the KimiAudio class is available after installation
from kimia_infer.api.kimia import KimiAudio
import torch # Ensure torch is imported if needed for device placement

# --- 1. Load Model ---
# Load the model from Hugging Face Hub
# Make sure you are logged in (`huggingface-cli login`) if the repo is private.
model_id = "moonshotai/Kimi-Audio-7B-Instruct" # Or "Kimi/Kimi-Audio-7B"
device = "cuda" if torch.cuda.is_available() else "cpu" # Example device placement
# Note: The KimiAudio class might handle model loading differently.
# You might need to pass the model_id directly or download checkpoints manually
# and provide the local path as shown in the original readme_kimia.md.
# Please refer to the main Kimi-Audio repository for precise loading instructions.
# Example assuming KimiAudio takes the HF ID or a local path:
try:
    model = KimiAudio(model_path=model_id, load_detokenizer=True) # May need device argument
    model.to(device) # Example device placement
except Exception as e:
    print(f"Automatic loading from HF Hub might require specific setup.")
    print(f"Refer to Kimi-Audio docs. Trying local path example (update path!). Error: {e}")
    # Fallback example:
    # model_path = "/path/to/your/downloaded/kimia-hf-ckpt" # IMPORTANT: Update this path if loading locally
    # model = KimiAudio(model_path=model_path, load_detokenizer=True)
    # model.to(device) # Example device placement

# --- 2. Define Sampling Parameters ---
sampling_params = {
    "audio_temperature": 0.8,
    "audio_top_k": 10,
    "text_temperature": 0.0,
    "text_top_k": 5,
    "audio_repetition_penalty": 1.0,
    "audio_repetition_window_size": 64,
    "text_repetition_penalty": 1.0,
    "text_repetition_window_size": 16,
}

# --- 3. Example 1: Audio-to-Text (ASR) ---
# TODO: Provide actual example audio files or URLs accessible to users
# E.g., download sample files first or use URLs
# wget https://path/to/your/asr_example.wav -O asr_example.wav
# wget https://path/to/your/qa_example.wav -O qa_example.wav
asr_audio_path = "asr_example.wav" # IMPORTANT: Make sure this file exists
qa_audio_path = "qa_example.wav" # IMPORTANT: Make sure this file exists

messages_asr = [
    {"role": "user", "message_type": "text", "content": "Please transcribe the following audio:"},
    {"role": "user", "message_type": "audio", "content": asr_audio_path}
]

# Generate only text output
# Note: Ensure the model object and generate method accept device placement if needed
_, text_output = model.generate(messages_asr, **sampling_params, output_type="text")
print(">>> ASR Output Text: ", text_output)
# Expected output: "这并不是告别，这是一个篇章的结束，也是新篇章的开始。" (Example)

# --- 4. Example 2: Audio-to-Audio/Text Conversation ---
messages_conversation = [
    {"role": "user", "message_type": "audio", "content": qa_audio_path}
]

# Generate both audio and text output
wav_output, text_output = model.generate(messages_conversation, **sampling_params, output_type="both")

# Save the generated audio
output_audio_path = "output_audio.wav"
# Ensure wav_output is on CPU and flattened before saving
sf.write(output_audio_path, wav_output.detach().cpu().view(-1).numpy(), 24000) # Assuming 24kHz output
print(f">>> Conversational Output Audio saved to: {output_audio_path}")
print(">>> Conversational Output Text: ", text_output)
# Expected output: "A." (Example)

print("Kimi-Audio inference examples complete.")

✨ Features

We present Kimi - Audio, an open - source audio foundation model excelling in audio understanding, generation, and conversation. This repository hosts the model checkpoints for Kimi - Audio - 7B - Instruct.

Kimi - Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include:

Universal Capabilities: Handles diverse tasks like speech recognition (ASR), audio question answering (AQA), audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC) and end - to - end speech conversation.
State - of - the - Art Performance: Achieves SOTA results on numerous audio benchmarks (see our Technical Report).
Large - Scale Pre - training: Pre - trained on over 13 million hours of diverse audio data (speech, music, sounds) and text data.
Novel Architecture: Employs a hybrid audio input (continuous acoustic + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation.
Efficient Inference: Features a chunk - wise streaming detokenizer based on flow matching for low - latency audio generation.

For more details, please refer to our GitHub Repository and Technical Report.

📦 Installation

We recommend that you build a Docker image to run the inference. After cloning the inference code, you can construct the image using the docker build command.

git clone https://github.com/MoonshotAI/Kimi-Audio
git submodule update --init
cd Kimi-Audio
docker build -t kimi-audio:v0.1 .

Alternatively, You can also use our pre - built image:

docker pull moonshotai/kimi-audio:v0.1

Or, you can install requirements by:

pip install -r requirements.txt

You may refer to the Dockerfile in case of any environment issues.

📚 Documentation

Model Information

Property	Details
Model Type	Kimi - Audio is an open - source audio foundation model.
Training Data	Pre - trained on over 13 million hours of diverse audio data (speech, music, sounds) and text data.

Links

🤗 Kimi - Audio - 7B | 🤗 Kimi - Audio - 7B - Instruct | 📑 Paper

📄 License

The model is based and modified from Qwen 2.5 - 7B. Code derived from Qwen2.5 - 7B is licensed under the Apache 2.0 License. Other parts of the code are licensed under the MIT License.

📖 Citation

If you find Kimi - Audio useful in your research or applications, please cite our technical report:

@misc{kimi_audio_2024,
      title={Kimi-Audio Technical Report},
      author={Kimi Team},
      year={2024},
      eprint={arXiv:placeholder},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご