🚀 Shuka v1: An Audio-Understanding Language Model
Shuka v1
is a remarkable language model that has the native ability to understand audio in Indic languages. It combines two key models to form an encoder - decoder architecture:
- Our cutting - edge in - house audio encoder, Saaras v1.
- Meta’s Llama3 - 8B - Instruct as the decoder.
A small projector with approximately 60M parameters connects the encoder and the decoder. During the training process, only the weights of the projector are fine - tuned, while the rest of the network remains frozen. True to our tradition of cost - effective model training, we trained Shuka v1
on less than 100 hours of audio.
Even though we fine - tune the projector only on English and Hindi data, the multilingual nature of our encoder enables Shuka v1
to perform well in zero - shot QA tasks for other Indic languages. We have conducted tests on the model using Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.
🚀 Quick Start
See what Shuka v1
can do in this demo video. You can get started by using the Hugging Face pipeline as follows:
import transformers
import librosa
pipe = transformers.pipeline(model='sarvamai/shuka_v1', trust_remote_code=True, device=0, torch_dtype='bfloat16')
audio, sr = librosa.load("./hi-question.webm", sr=16000)
turns = [
{'role': 'system', 'content': 'Respond naturally and informatively.'},
{'role': 'user', 'content': '<|audio|>'}
]
pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=512)
✨ Features
- Native Audio Understanding:
Shuka v1
can natively understand audio in Indic languages.
- Encoder - Decoder Architecture: Built by combining Saaras v1 audio encoder and Llama3 - 8B - Instruct decoder.
- Frugal Training: Trained on less than 100 hours of audio with only the projector weights fine - tuned.
- Multilingual Performance: Performs well in zero - shot QA for multiple Indic languages.
📦 Installation
To use Shuka v1
, you need to install the following libraries:
pip install transformers==4.41.2 peft==0.11.1 librosa==0.10.2
📚 Documentation
For more details, please see our blog.
📄 License
This project is licensed under the llama3
license.
📋 Information Table
Property |
Details |
Library Name |
transformers |
Pipeline Tag |
audio - text - to - text |
Model Type |
Encoder - Decoder (Combination of Saaras v1 and Llama3 - 8B - Instruct) |
Training Data |
Less than 100 hours of audio, fine - tuned on English and Hindi data |
Supported Languages |
Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu |
License |
llama3 |