🚀 Pathumma-Audio
Pathumma-Audio is an 8-billion-parameter Thai large language model tailored for audio understanding tasks. It can handle various audio inputs, such as speech, general audio, and music, and transform them into meaningful text.
🚀 Quick Start
To load the model and generate responses using the Hugging Face Transformers library, follow these steps:
1. Install the required dependencies:
Ensure you have the necessary libraries installed by running:
pip install librosa torch torchaudio transformers peft
2. Load the model and generate a response:
You can load the model and use it to generate a response with the following code snippet:
import torch
import librosa
from transformers import AutoModel
device = "cuda" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32
model = AutoModel.from_pretrained(
"nectec/Pathumma-llm-audio-1.0.0",
torch_dtype=torch.bfloat16,
lora_infer_mode=True,
init_from_scratch=True,
trust_remote_code=True
)
model = model.to(device)
prompt = "ถอดเสียงเป็นข้อความ"
audio_path = "audio_path.wav"
audio, sr = librosa.load(audio_path, sr=16000)
model.eval()
with torch.no_grad():
response = model.generate(
raw_wave=audio,
prompts=prompt,
device=device,
max_new_tokens=200,
repetition_penalty=1.0,
)
print(response[0])
✨ Features
- Multimodal Audio Processing: The model can process multiple types of audio inputs, including speech, general audio, and music, and convert them into meaningful textual representations.
📚 Documentation
Model Description
Pathumma-llm-audio-1.0.0 is an 8-billion-parameter Thai large language model designed for audio understanding tasks. The model can process multiple types of audio inputs, including speech, general audio, and music, and convert them into meaningful textual representations.
Model Architecture
The model combines two key components:
Limitations and Future Work
Currently, our model is still in the experimental research phase and is not fully suitable for practical applications as an assistant. The model currently has an input duration limit, processing audio inputs up to 30 seconds, which restricts its usability for longer audio tasks. Future work will focus on upgrading the language model to a newer version Pathumma-llm-text-1.0.0 and curating more refined and robust datasets to improve performance. Additionally, we aim to address and prioritize the safety and reliability of the model's outputs.
Acknowledgements
We are grateful to ThaiSC, also known as NSTDA Supercomputer Centre, for providing the LANTA that was used for model training and finetuning. Additionally, we would like to express our gratitude to the SALMONN team for making their code publicly available and to Typhoon Audio at SCB 10X for making available the huggingface project, source code, and technical paper, which served as a valuable guide for us. Many other open-source projects have contributed valuable information, code, data, and model weights; we are grateful to them all.
Pathumma Audio Team
Pattara Tipaksorn, Wayupuk Sommuang, Oatsada Chatthong, Kwanchiva Thangthai
Citation
@misc{tipaksorn2024PathummaAudio,
title = { {Pathumma-Audio} },
author = { Pattara Tipaksorn and Wayupuk Sommuang and Kwanchiva Thangthai },
url = { https://huggingface.co/nectec/Pathumma-llm-audio-1.0.0 },
publisher = { Hugging Face },
year = { 2024 },
}
📄 License
This project is licensed under the Apache-2.0 license.