SpeechLLM - 1.5B Open Source Model: Accurately Predict Metadata of Conversation Speakers, Including Emotions, Accents, etc.!

Speechllm 1.5B

Developed by skit-ai

SpeechLLM is a multimodal large language model designed to predict speaker turn metadata in conversations, including speech activity, transcribed text, gender, age, accent, and emotion.

Text-to-Audio

Transformers

EnglishOpen Source License:Apache-2.0 #Multimodal Speech Understanding #Speaker Metadata Prediction #Low-resource ASR

Downloads 40

Release Time : 6/20/2024

Model Overview

SpeechLLM is based on the HubertX audio encoder and TinyLlama LLM, capable of processing speech signals and generating rich metadata information.

Model Features

Multimodal Processing Capability

Combines audio signal processing with language model capabilities to understand speech content and generate metadata.

Rich Metadata Prediction

Can predict various information such as speech activity, transcribed text, speaker gender, age, accent, and emotion.

Diverse Dataset Training

Trained on multiple speech datasets including Common Voice and LibriSpeech, enhancing the model's generalization ability.

Model Capabilities

Speech Activity Detection

Automatic Speech Recognition

Speaker Gender Classification

Speaker Age Classification

Speaker Accent Classification

Emotion Recognition

Use Cases

Speech Analysis

Customer Service Dialogue Analysis

Analyze speaker characteristics and emotional states in customer service conversations.

Identifies customer emotions and demographic information to help improve service quality.

Enhanced Speech Transcription

Add speaker metadata to speech transcriptions.

Provides richer transcription text information, including speaker characteristics.

Conversational Systems

Intelligent Voice Assistant

Build conversational agents capable of understanding speaker characteristics.

Delivers personalized responses based on speaker features.

🚀 SpeechLLM

SpeechLLM is a multi-modal large language model (LLM) designed to predict the metadata of a speaker's turn in a conversation. It can analyze various aspects such as speech activity, transcript, gender, age, accent, and emotion from audio input.

🚀 Quick Start

SpeechLLM offers a straightforward way to get started with speech analysis. You can load the model directly from Hugging Face and use it to generate metadata from audio files.

✨ Features

Multi-modal Analysis: Predicts multiple speaker metadata including speech activity, transcript, gender, age, accent, and emotion.
Based on Strong Encoders: Built on HubertX audio encoder and TinyLlama LLM.
Versatile Output: Provides detailed information about the audio input in a structured format.

📦 Installation

No specific installation steps are provided in the original README. However, to use the model, you need to have the transformers library installed. You can install it using the following command:

pip install transformers

💻 Usage Examples

Basic Usage

# Load model directly from huggingface
from transformers import AutoModel
model = AutoModel.from_pretrained("skit-ai/speechllm-1.5B", trust_remote_code=True)

model.generate_meta(
    audio_path="path-to-audio.wav", #16k Hz, mono
    audio_tensor=torchaudio.load("path-to-audio.wav")[1], # [Optional] either audio_path or audio_tensor directly
    instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
    max_new_tokens=500, 
    return_special_tokens=False
)

# Model Generation
'''
{
  "SpeechActivity" : "True",
  "Transcript": "Yes, I got it. I'll make the payment now.",
  "Gender": "Female",
  "Emotion": "Neutral",
  "Age": "Young",
  "Accent" : "America",
}
'''

Advanced Usage

You can try the model in Google Colab Notebook. Also, check out our blog on SpeechLLM for end-to-end conversational agents(User Speech -> Response).

📚 Documentation

Model Details

Property	Details
Developed by	Skit AI
Authors	Shangeth Rajaa, Abhinav Tushar
Language	English
Finetuned from model	WavLM, TinyLlama
Model Size	1.5 B
Checkpoint	1200 k steps (bs=1)
Adapters	r=8, alpha=16
lr	1e-4
gradient accumulation steps	8

Checkpoint Result

Dataset	Type	Word Error Rate	Gender Acc	Age Acc	Accent Acc
librispeech-test-clean	Read Speech	11.51	0.9594	-	-
librispeech-test-other	Read Speech	16.68	0.9297	-	-
CommonVoice test	Diverse Accent, Age	26.02	0.9476	0.6498	0.8121

📄 License

This project is licensed under the Apache 2.0 License.

📚 Cite

@misc{Rajaa_SpeechLLM_Multi-Modal_LLM,
author = {Rajaa, Shangeth and Tushar, Abhinav},
title = {{SpeechLLM: Multi-Modal LLM for Speech Understanding}},
url = {https://github.com/skit-ai/SpeechLLM}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご