Shuka-1 Open-Source Language Model - Supports Audio Understanding in Indian Languages and Zero-Shot Question-Answering in Multiple Languages

Shuka 1

Developed by sarvamai

Shuka v1 is a language model natively supporting Indian language audio understanding, combining a self-developed audio encoder with the Llama3-8B-Instruct decoder, enabling zero-shot multilingual question-answering tasks.

Audio-to-Text

Transformers

Supports Multiple Languages#Indian Language Audio Understanding #Zero-shot Multilingual Support #Efficient Fine-tuning Projector

Downloads 729

Release Time : 8/8/2024

Model Overview

Shuka v1 is an audio-to-text model specifically designed for Indian languages, supporting English and Hindi while excelling in other Indian languages.

Model Features

Multilingual Support

Natively supports English and Hindi while excelling in other Indian languages.

Efficient Training

Trained with less than 100 hours of audio data, fine-tuning only the projector weights.

Zero-shot Question Answering

Performs exceptionally well in zero-shot question-answering tasks for other Indian languages.

Model Capabilities

Audio-to-Text

Multilingual Audio Understanding

Zero-shot Question Answering

Use Cases

Speech Recognition

Hindi Speech-to-Text

Convert Hindi audio into text

Highly accurate text output

Multilingual Question Answering

Multilingual Zero-shot Question Answering

Perform question-answering tasks in languages not specifically trained on

Exceptional performance

🚀 Shuka v1: An Audio-Understanding Language Model

Shuka v1 is a remarkable language model that has the native ability to understand audio in Indic languages. It combines two key models to form an encoder - decoder architecture:

Our cutting - edge in - house audio encoder, Saaras v1.
Meta’s Llama3 - 8B - Instruct as the decoder.

A small projector with approximately 60M parameters connects the encoder and the decoder. During the training process, only the weights of the projector are fine - tuned, while the rest of the network remains frozen. True to our tradition of cost - effective model training, we trained Shuka v1 on less than 100 hours of audio.

Even though we fine - tune the projector only on English and Hindi data, the multilingual nature of our encoder enables Shuka v1 to perform well in zero - shot QA tasks for other Indic languages. We have conducted tests on the model using Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.

🚀 Quick Start

See what Shuka v1 can do in this demo video. You can get started by using the Hugging Face pipeline as follows:

# install libraries
# pip install transformers==4.41.2 peft==0.11.1 librosa==0.10.2

import transformers
import librosa

# load the model pipeline on gpu:0
pipe = transformers.pipeline(model='sarvamai/shuka_v1', trust_remote_code=True, device=0, torch_dtype='bfloat16')

# get a sample audio
# wget https://huggingface.co/sarvamai/shuka_v1/resolve/main/hi-question.webm

audio, sr = librosa.load("./hi-question.webm", sr=16000)
turns = [
          {'role': 'system', 'content': 'Respond naturally and informatively.'},
          {'role': 'user', 'content': '<|audio|>'}
        ]

pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=512)

✨ Features

Native Audio Understanding: Shuka v1 can natively understand audio in Indic languages.
Encoder - Decoder Architecture: Built by combining Saaras v1 audio encoder and Llama3 - 8B - Instruct decoder.
Frugal Training: Trained on less than 100 hours of audio with only the projector weights fine - tuned.
Multilingual Performance: Performs well in zero - shot QA for multiple Indic languages.

📦 Installation

To use Shuka v1, you need to install the following libraries:

pip install transformers==4.41.2 peft==0.11.1 librosa==0.10.2

📚 Documentation

For more details, please see our blog.

📄 License

This project is licensed under the llama3 license.

📋 Information Table

Property	Details
Library Name	transformers
Pipeline Tag	audio - text - to - text
Model Type	Encoder - Decoder (Combination of Saaras v1 and Llama3 - 8B - Instruct)
Training Data	Less than 100 hours of audio, fine - tuned on English and Hindi data
Supported Languages	Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu
License	llama3

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご