llama-3-typhoon-v1.5-8b-audio-preview Open-source Model - Supports Input and Output of Audio and Text in Thai and English

Llama 3 Typhoon V1.5 8b Audio Preview

Developed by scb10x

Typhoon-Audio Preview is a Thai and English audio-language model capable of processing text and audio inputs, with text outputs.

Audio-to-Text

Transformers

#Thai-English Audio Transcription #Multimodal Speech Processing #Low-Resource Language Support

Downloads 218

Release Time : 8/10/2024

Model Overview

A multimodal audio-language model based on Typhoon-1.5-8b-instruct, supporting Thai and English audio transcription, translation, and Q&A tasks.

Model Features

Multimodal Input

Natively supports text and audio inputs, with text outputs.

Thai Language Optimization

Specially optimized for Thai, supporting Thai audio transcription and Q&A.

High Performance

Excels in Thai ASR, translation, and Q&A tasks, outperforming similar models.

Model Capabilities

Audio Transcription

Text Generation

Speech Command Understanding

Multilingual Translation

Use Cases

Speech Transcription

Thai Audio Transcription

Transcribes Thai speech content into text.

WER (Word Error Rate) of 14.17

Translation

English to Thai

Translates English speech or text into Thai.

BLEU score of 17.52

Thai to English

Translates Thai speech or text into English.

BLEU score of 24.14

Q&A

Thai Spoken Q&A

Answers questions based on Thai speech inputs.

F1 score of 64.60

🚀 Typhoon-Audio Preview

Typhoon-Audio Preview is a Thai audio-language model that supports text and audio input and text output. It's a research preview version as part of multimodal efforts.

🚀 Quick Start

llama-3-typhoon-v1.5-8b-audio-preview is a 🇹🇭 Thai audio-language model. It natively supports both text and audio input modalities, while the output is text. This version (August 2024) is our first audio-language model as part of our multimodal effort, and it is a research preview version. The base language model is our llama-3-typhoon-v1.5-8b-instruct.

More details can be found in our technical report. To acknowledge Meta's effort in creating the foundation model and to comply with the license, we explicitly include "llama-3" in the model name.

✨ Features

Model type: The LLM is based on Typhoon-1.5-8b-instruct, and the audio encoder is based on Whisper's encoder and BEATs.
Requirement: transformers 4.38.0 or newer.
Primary Language(s): Thai 🇹🇭 and English 🇺🇸
Demo: https://audio.opentyphoon.ai/
License: Llama 3 Community License

Property	Details
Model Type	The LLM is based on Typhoon-1.5-8b-instruct, and the audio encoder is based on Whisper's encoder and BEATs.
Requirement	transformers 4.38.0 or newer
Primary Language(s)	Thai 🇹🇭 and English 🇺🇸
Demo	https://audio.opentyphoon.ai/
License	Llama 3 Community License

💻 Usage Examples

Basic Usage

from transformers import AutoModel
import soundfile as sf
import librosa

# Initialize from the trained model
model = AutoModel.from_pretrained(
    "scb10x/llama-3-typhoon-v1.5-8b-audio-preview", 
    torch_dtype=torch.float16,
    trust_remote_code=True
)
model.to("cuda")
model.eval()

# read a wav file (it needs to be in 16 kHz and clipped to 30 seconds)
audio, sr = sf.read("path_to_your_audio.wav")
if len(audio.shape) == 2:
    audio = audio[:, 0]
if len(audio) > 30 * sr:
    audio = audio[: 30 * sr]
if sr != 16000:
    audio = librosa.resample(audio, orig_sr=sr, target_sr=16000, res_type="fft")

# Run generation
prompt_pattern="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<Speech><SpeechHere></Speech> {}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
response = model.generate(
    audio=audio,
    prompt="transcribe this audio",
    prompt_pattern=prompt_pattern,
    do_sample=False,
    max_new_tokens=512,
    repetition_penalty=1.1,
    num_beams=1,
    # temperature=0.4,
    # top_p=0.9,
)
print(response)

Advanced Usage

# Generation Parameters:
# audio -- audio input, e.g., using `soundfile.read` or `librosa.resample` to read a wav file like the example above
# prompt (`str`) -- Text input to the model
# prompt_pattern (`str`) -- Chat template that is augmented with special tokens, and it must be set the same as one during training
# max_new_tokens (`int`, *optional*, defaults to 1024)
# num_beams (`int`, *optional*, defaults to 4)
# do_sample (`bool`, *optional*, defaults to True)
# top_p (`float`, *optional*, defaults to 0.9)
# repetition_penalty (`float`, *optional*, defaults to 1.0),
# length_penalty (`float`, *optional*, defaults to 1.0),
# temperature (`float`, *optional*, defaults to 1.0),

# This is also `model.generate_stream()` for streaming generation. Please refer to `modeling_typhoonaudio.py` for this function.

📚 Documentation

More information is provided in our technical report.

Model	ASR-en (WER↓)	ASR-th (WER↓)	En2Th (BLEU↑)	X2Th (BLEU↑)	Th2En (BLEU↑)
SALMONN-13B	5.79	98.07	0.07	0.10	14.97
DiVA-8B	30.28	65.21	9.82	5.31	7.97
Gemini-1.5-pro-001	5.98	13.56	20.69	13.52	22.54
Typhoon-Audio-Preview	8.72	14.17	17.52	10.67	24.14

Model	Gender-th (Acc)	SpokenQA-th (F1)	SpeechInstruct-th
SALMONN-13B	93.26	2.95	1.18
DiVA-8B	50.12	15.13	2.68
Gemini-1.5-pro-001	81.32	62.10	3.93
Typhoon-Audio-Preview	93.74	64.60	6.11

🔧 Technical Details

This model is experimental and may not always follow human instructions accurately, making it prone to generating hallucinations. Additionally, the model lacks moderation mechanisms and may produce harmful or inappropriate responses. Developers should carefully assess potential risks based on their specific applications.

⚠️ Important Note

This model is experimental and may not always follow human instructions accurately, making it prone to generating hallucinations. Additionally, the model lacks moderation mechanisms and may produce harmful or inappropriate responses. Developers should carefully assess potential risks based on their specific applications.

📄 License

The model is released under the Llama 3 Community License.

Follow us & Support

https://twitter.com/opentyphoon
https://discord.gg/us5gAYmrxw

Acknowledgements

We would like to thank the SALMONN team for open-sourcing their code and data, and thanks to the Biomedical and Data Lab at Mahidol University for releasing the fine-tuned Whisper that allowed us to adopt its encoder. Thanks to many other open-source projects for their useful knowledge sharing, data, code, and model weights.

Typhoon Team

Potsawee Manakul, Sittipong Sripaisarnmongkol, Natapong Nitarach, Warit Sirichotedumrong, Adisai Na-Thalang, Phatrasek Jirabovonvisut, Parinthapat Pengpun, Krisanapong Jirayoot, Pathomporn Chokchainant, Kasima Tharnpipitchai, Kunat Pipatanakul

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご