SpeechLLM-2B Open-Source Multimodal Language Model - Free Prediction of Multiple Metadata of Dialogue Speakers

Speechllm 2B

Developed by skit-ai

SpeechLLM is a multimodal large language model trained to predict speaker turn metadata in conversations, including speech activity, transcribed text, speaker gender, age, accent, and emotion.

Text-to-Audio

Transformers

EnglishOpen Source License:Apache-2.0 #Multimodal speech understanding #Speaker metadata prediction #Low WER ASR

Downloads 237

Release Time : 6/4/2024

Model Overview

A multimodal model based on HubertX audio encoder and TinyLlama LLM, capable of extracting rich metadata information from audio signals.

Model Features

Multimodal processing capability

Processes both audio and text information simultaneously for speech understanding and metadata prediction

Rich metadata prediction

Can predict various information including speech activity, transcribed text, gender, age, accent, and emotion

High-performance ASR

Achieves WER performance of 6.73-9.13 on the LibriSpeech test set

Model Capabilities

Voice activity detection

Automatic speech recognition

Speaker gender classification

Speaker age classification

Speaker accent classification

Speaker emotion recognition

Use Cases

Speech analysis

Customer service dialogue analysis

Analyze speaker characteristics and emotions in customer service conversations

Can identify customer emotional states and demographic information

Enhanced speech transcription

Add rich metadata to speech transcriptions

Provides more comprehensive dialogue analysis dimensions

🚀 SpeechLLM

SpeechLLM is a multi-modal large language model (LLM) designed to predict the metadata of a speaker's turn in a conversation. It combines a HubertX audio encoder with a TinyLlama LLM, enabling it to analyze audio signals and extract valuable information about the speaker.

🚀 Quick Start

🔗 Links

🖼️ Model Image

📋 Model Capabilities

SpeechLLM can predict the following metadata from audio:

SpeechActivity: Whether the audio signal contains speech (True/False)
Transcript: The automatic speech recognition (ASR) transcript of the audio
Gender of the speaker (Female/Male)
Age of the speaker (Young/Middle-Age/Senior)
Accent of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia)
Emotion of the speaker (Happy/Sad/Anger/Neutral/Frustrated)

✨ Features

Multi-modal: Combines audio encoding and language modeling for comprehensive speech analysis.
Versatile Predictions: Provides a wide range of metadata about the speaker.
Easy to Use: Can be loaded directly from Hugging Face and used with simple instructions.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

# Load model directly from huggingface
from transformers import AutoModel
model = AutoModel.from_pretrained("skit-ai/speechllm-2B", trust_remote_code=True)

model.generate_meta(
    audio_path="path-to-audio.wav", #16k Hz, mono
    audio_tensor=torchaudio.load("path-to-audio.wav")[1], # [Optional] either audio_path or audio_tensor directly
    instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
    max_new_tokens=500, 
    return_special_tokens=False
)

# Model Generation
'''
{
  "SpeechActivity" : "True",
  "Transcript": "Yes, I got it. I'll make the payment now.",
  "Gender": "Female",
  "Emotion": "Neutral",
  "Age": "Young",
  "Accent" : "America",
}
'''

🔗 Try it Out

You can try the model in Google Colab Notebook. Also, check out our blog on SpeechLLM for end-to-end conversational agents (User Speech -> Response).

📚 Documentation

Model Details

Property	Details
Developed by	Skit AI
Authors	Shangeth Rajaa, Abhinav Tushar
Language	English
Finetuned from model	HubertX, TinyLlama
Model Size	2.1 B
Checkpoint	2000 k steps (bs=1)
Adapters	r=4, alpha=8
lr	1e-4
gradient accumulation steps	8

Checkpoint Result

Dataset	Type	Word Error Rate	Gender Acc	Age Acc	Accent Acc
librispeech-test-clean	Read Speech	6.73	0.9496	-	-
librispeech-test-other	Read Speech	9.13	0.9217	-	-
CommonVoice test	Diverse Accent, Age	25.66	0.8680	0.6041	0.6959

Cite

@misc{Rajaa_SpeechLLM_Multi-Modal_LLM,
author = {Rajaa, Shangeth and Tushar, Abhinav},
title = {{SpeechLLM: Multi-Modal LLM for Speech Understanding}},
url = {https://github.com/skit-ai/SpeechLLM}
}

📄 License

This project is licensed under the Apache 2.0 License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご