🚀 Kimi-Audio
Kimi-Audio is an open - source audio foundation model that excels in audio understanding, generation, and conversation, offering a unified framework for various audio processing tasks.
🚀 Quick Start
This example demonstrates basic usage for generating text from audio (ASR) and generating both text and speech in a conversational turn using the Kimi - Audio - 7B - Instruct
model.
import soundfile as sf
from kimia_infer.api.kimia import KimiAudio
import torch
model_id = "moonshotai/Kimi-Audio-7B-Instruct"
device = "cuda" if torch.cuda.is_available() else "cpu"
try:
model = KimiAudio(model_path=model_id, load_detokenizer=True)
model.to(device)
except Exception as e:
print(f"Automatic loading from HF Hub might require specific setup.")
print(f"Refer to Kimi-Audio docs. Trying local path example (update path!). Error: {e}")
sampling_params = {
"audio_temperature": 0.8,
"audio_top_k": 10,
"text_temperature": 0.0,
"text_top_k": 5,
"audio_repetition_penalty": 1.0,
"audio_repetition_window_size": 64,
"text_repetition_penalty": 1.0,
"text_repetition_window_size": 16,
}
asr_audio_path = "asr_example.wav"
qa_audio_path = "qa_example.wav"
messages_asr = [
{"role": "user", "message_type": "text", "content": "Please transcribe the following audio:"},
{"role": "user", "message_type": "audio", "content": asr_audio_path}
]
_, text_output = model.generate(messages_asr, **sampling_params, output_type="text")
print(">>> ASR Output Text: ", text_output)
messages_conversation = [
{"role": "user", "message_type": "audio", "content": qa_audio_path}
]
wav_output, text_output = model.generate(messages_conversation, **sampling_params, output_type="both")
output_audio_path = "output_audio.wav"
sf.write(output_audio_path, wav_output.detach().cpu().view(-1).numpy(), 24000)
print(f">>> Conversational Output Audio saved to: {output_audio_path}")
print(">>> Conversational Output Text: ", text_output)
print("Kimi-Audio inference examples complete.")
✨ Features
We present Kimi - Audio, an open - source audio foundation model excelling in audio understanding, generation, and conversation. This repository hosts the model checkpoints for Kimi - Audio - 7B - Instruct.
Kimi - Audio is designed as a universal audio foundation model capable of handling a wide variety of audio processing tasks within a single unified framework. Key features include:
- Universal Capabilities: Handles diverse tasks like speech recognition (ASR), audio question answering (AQA), audio captioning (AAC), speech emotion recognition (SER), sound event/scene classification (SEC/ASC) and end - to - end speech conversation.
- State - of - the - Art Performance: Achieves SOTA results on numerous audio benchmarks (see our Technical Report).
- Large - Scale Pre - training: Pre - trained on over 13 million hours of diverse audio data (speech, music, sounds) and text data.
- Novel Architecture: Employs a hybrid audio input (continuous acoustic + discrete semantic tokens) and an LLM core with parallel heads for text and audio token generation.
- Efficient Inference: Features a chunk - wise streaming detokenizer based on flow matching for low - latency audio generation.
For more details, please refer to our GitHub Repository and Technical Report.
📦 Installation
We recommend that you build a Docker image to run the inference. After cloning the inference code, you can construct the image using the docker build
command.
git clone https://github.com/MoonshotAI/Kimi-Audio
git submodule update --init
cd Kimi-Audio
docker build -t kimi-audio:v0.1 .
Alternatively, You can also use our pre - built image:
docker pull moonshotai/kimi-audio:v0.1
Or, you can install requirements by:
pip install -r requirements.txt
You may refer to the Dockerfile in case of any environment issues.
📚 Documentation
Model Information
Property |
Details |
Model Type |
Kimi - Audio is an open - source audio foundation model. |
Training Data |
Pre - trained on over 13 million hours of diverse audio data (speech, music, sounds) and text data. |
Links
🤗 Kimi - Audio - 7B | 🤗 Kimi - Audio - 7B - Instruct | 📑 Paper
📄 License
The model is based and modified from Qwen 2.5 - 7B. Code derived from Qwen2.5 - 7B is licensed under the Apache 2.0 License. Other parts of the code are licensed under the MIT License.
📖 Citation
If you find Kimi - Audio useful in your research or applications, please cite our technical report:
@misc{kimi_audio_2024,
title={Kimi-Audio Technical Report},
author={Kimi Team},
year={2024},
eprint={arXiv:placeholder},
archivePrefix={arXiv},
primaryClass={cs.CL}
}