đ SpeechLLM
SpeechLLM is a multi-modal large language model (LLM) designed to predict the metadata of a speaker's turn in a conversation. It combines a HubertX audio encoder with a TinyLlama LLM, enabling it to analyze audio signals and extract valuable information about the speaker.
đ Quick Start
đ Links
đŧī¸ Model Image

đ Model Capabilities
SpeechLLM can predict the following metadata from audio:
- SpeechActivity: Whether the audio signal contains speech (True/False)
- Transcript: The automatic speech recognition (ASR) transcript of the audio
- Gender of the speaker (Female/Male)
- Age of the speaker (Young/Middle-Age/Senior)
- Accent of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia)
- Emotion of the speaker (Happy/Sad/Anger/Neutral/Frustrated)
⨠Features
- Multi-modal: Combines audio encoding and language modeling for comprehensive speech analysis.
- Versatile Predictions: Provides a wide range of metadata about the speaker.
- Easy to Use: Can be loaded directly from Hugging Face and used with simple instructions.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
from transformers import AutoModel
model = AutoModel.from_pretrained("skit-ai/speechllm-2B", trust_remote_code=True)
model.generate_meta(
audio_path="path-to-audio.wav",
audio_tensor=torchaudio.load("path-to-audio.wav")[1],
instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
max_new_tokens=500,
return_special_tokens=False
)
'''
{
"SpeechActivity" : "True",
"Transcript": "Yes, I got it. I'll make the payment now.",
"Gender": "Female",
"Emotion": "Neutral",
"Age": "Young",
"Accent" : "America",
}
'''
đ Try it Out
You can try the model in Google Colab Notebook. Also, check out our blog on SpeechLLM for end-to-end conversational agents (User Speech -> Response).
đ Documentation
Model Details
Property |
Details |
Developed by |
Skit AI |
Authors |
Shangeth Rajaa, Abhinav Tushar |
Language |
English |
Finetuned from model |
HubertX, TinyLlama |
Model Size |
2.1 B |
Checkpoint |
2000 k steps (bs=1) |
Adapters |
r=4, alpha=8 |
lr |
1e-4 |
gradient accumulation steps |
8 |
Checkpoint Result
Dataset |
Type |
Word Error Rate |
Gender Acc |
Age Acc |
Accent Acc |
librispeech-test-clean |
Read Speech |
6.73 |
0.9496 |
- |
- |
librispeech-test-other |
Read Speech |
9.13 |
0.9217 |
- |
- |
CommonVoice test |
Diverse Accent, Age |
25.66 |
0.8680 |
0.6041 |
0.6959 |
Cite
@misc{Rajaa_SpeechLLM_Multi-Modal_LLM,
author = {Rajaa, Shangeth and Tushar, Abhinav},
title = {{SpeechLLM: Multi-Modal LLM for Speech Understanding}},
url = {https://github.com/skit-ai/SpeechLLM}
}
đ License
This project is licensed under the Apache 2.0 License. 