Pathumma-llm-audio-1.0.0 Open Source Thai Large Language Model - Free Deployment for Processing Multiple Audio Understanding Tasks

Pathumma Llm Audio 1.0.0

Developed by nectec

Pathumma-llm-audio-1.0.0 is an 8-billion-parameter Thai large language model specifically designed for audio comprehension tasks, capable of processing various audio inputs including speech, general audio, and music.

Audio-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Thai audio comprehension #Multimodal audio processing #Short audio transcription

Downloads 333

Release Time : 10/24/2024

Model Overview

This model combines the OpenThaiLLM-DoodNiLT-V1.0.0-Beta-7B language model with the Pathumma-whisper-th-large-v3 speech encoder to convert audio into meaningful text representations.

Model Features

Multi-type audio processing

Capable of processing various types of audio inputs including speech, general audio, and music.

Thai language optimization

Specially designed for Thai, with optimized capabilities for Thai speech and text conversion.

Efficient inference

Supports LoRA inference mode, suitable for operation with limited resources.

Model Capabilities

Audio transcription

Speech comprehension

Text generation

Use Cases

Speech transcription

Thai speech-to-text

Convert Thai speech into text output.

Audio comprehension

General audio analysis

Analyze general audio content and generate descriptive text.

🚀 Pathumma-Audio

Pathumma-Audio is an 8-billion-parameter Thai large language model tailored for audio understanding tasks. It can handle various audio inputs, such as speech, general audio, and music, and transform them into meaningful text.

🚀 Quick Start

To load the model and generate responses using the Hugging Face Transformers library, follow these steps:

1. Install the required dependencies:

Ensure you have the necessary libraries installed by running:

pip install librosa torch torchaudio transformers peft

2. Load the model and generate a response:

You can load the model and use it to generate a response with the following code snippet:

import torch
import librosa
from transformers import AutoModel

device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32

model = AutoModel.from_pretrained(
    "nectec/Pathumma-llm-audio-1.0.0",
    torch_dtype=torch.bfloat16,
    lora_infer_mode=True,
    init_from_scratch=True,
    trust_remote_code=True
)
model = model.to(device)

prompt = "ถอดเสียงเป็นข้อความ"
audio_path = "audio_path.wav"
audio, sr = librosa.load(audio_path, sr=16000)

model.eval()
with torch.no_grad():
  response = model.generate(
        raw_wave=audio,
        prompts=prompt,
        device=device,
        max_new_tokens=200,
        repetition_penalty=1.0,
)
print(response[0])

✨ Features

Multimodal Audio Processing: The model can process multiple types of audio inputs, including speech, general audio, and music, and convert them into meaningful textual representations.

📚 Documentation

Model Description

Pathumma-llm-audio-1.0.0 is an 8-billion-parameter Thai large language model designed for audio understanding tasks. The model can process multiple types of audio inputs, including speech, general audio, and music, and convert them into meaningful textual representations.

Model Architecture

The model combines two key components:

1. Base Language Model: OpenThaiLLM-DoodNiLT-V1.0.0-Beta-7B (Qwen2)
1. Base Speech Encoder: Pathumma-whisper-th-large-v3 (Whisper)

Limitations and Future Work

Currently, our model is still in the experimental research phase and is not fully suitable for practical applications as an assistant. The model currently has an input duration limit, processing audio inputs up to 30 seconds, which restricts its usability for longer audio tasks. Future work will focus on upgrading the language model to a newer version Pathumma-llm-text-1.0.0 and curating more refined and robust datasets to improve performance. Additionally, we aim to address and prioritize the safety and reliability of the model's outputs.

Acknowledgements

We are grateful to ThaiSC, also known as NSTDA Supercomputer Centre, for providing the LANTA that was used for model training and finetuning. Additionally, we would like to express our gratitude to the SALMONN team for making their code publicly available and to Typhoon Audio at SCB 10X for making available the huggingface project, source code, and technical paper, which served as a valuable guide for us. Many other open-source projects have contributed valuable information, code, data, and model weights; we are grateful to them all.

Pathumma Audio Team

Pattara Tipaksorn, Wayupuk Sommuang, Oatsada Chatthong, Kwanchiva Thangthai

Citation

@misc{tipaksorn2024PathummaAudio,
    title        = { {Pathumma-Audio} },
    author       = { Pattara Tipaksorn and Wayupuk Sommuang and Kwanchiva Thangthai },
    url          = { https://huggingface.co/nectec/Pathumma-llm-audio-1.0.0 },
    publisher    = { Hugging Face },
    year         = { 2024 },
}

📄 License

This project is licensed under the Apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご