🚀 Typhoon-Audio Preview
Typhoon-Audio Preview is a Thai audio-language model that supports text and audio input and text output. It's a research preview version as part of multimodal efforts.
🚀 Quick Start
llama-3-typhoon-v1.5-8b-audio-preview is a 🇹🇭 Thai audio-language model. It natively supports both text and audio input modalities, while the output is text. This version (August 2024) is our first audio-language model as part of our multimodal effort, and it is a research preview version. The base language model is our llama-3-typhoon-v1.5-8b-instruct.
More details can be found in our technical report. To acknowledge Meta's effort in creating the foundation model and to comply with the license, we explicitly include "llama-3" in the model name.
✨ Features
- Model type: The LLM is based on Typhoon-1.5-8b-instruct, and the audio encoder is based on Whisper's encoder and BEATs.
- Requirement: transformers 4.38.0 or newer.
- Primary Language(s): Thai 🇹🇭 and English 🇺🇸
- Demo: https://audio.opentyphoon.ai/
- License: Llama 3 Community License
Property |
Details |
Model Type |
The LLM is based on Typhoon-1.5-8b-instruct, and the audio encoder is based on Whisper's encoder and BEATs. |
Requirement |
transformers 4.38.0 or newer |
Primary Language(s) |
Thai 🇹🇭 and English 🇺🇸 |
Demo |
https://audio.opentyphoon.ai/ |
License |
Llama 3 Community License |
💻 Usage Examples
Basic Usage
from transformers import AutoModel
import soundfile as sf
import librosa
model = AutoModel.from_pretrained(
"scb10x/llama-3-typhoon-v1.5-8b-audio-preview",
torch_dtype=torch.float16,
trust_remote_code=True
)
model.to("cuda")
model.eval()
audio, sr = sf.read("path_to_your_audio.wav")
if len(audio.shape) == 2:
audio = audio[:, 0]
if len(audio) > 30 * sr:
audio = audio[: 30 * sr]
if sr != 16000:
audio = librosa.resample(audio, orig_sr=sr, target_sr=16000, res_type="fft")
prompt_pattern="<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n<Speech><SpeechHere></Speech> {}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
response = model.generate(
audio=audio,
prompt="transcribe this audio",
prompt_pattern=prompt_pattern,
do_sample=False,
max_new_tokens=512,
repetition_penalty=1.1,
num_beams=1,
)
print(response)
Advanced Usage
📚 Documentation
More information is provided in our technical report.
Model |
ASR-en (WER↓) |
ASR-th (WER↓) |
En2Th (BLEU↑) |
X2Th (BLEU↑) |
Th2En (BLEU↑) |
SALMONN-13B |
5.79 |
98.07 |
0.07 |
0.10 |
14.97 |
DiVA-8B |
30.28 |
65.21 |
9.82 |
5.31 |
7.97 |
Gemini-1.5-pro-001 |
5.98 |
13.56 |
20.69 |
13.52 |
22.54 |
Typhoon-Audio-Preview |
8.72 |
14.17 |
17.52 |
10.67 |
24.14 |
Model |
Gender-th (Acc) |
SpokenQA-th (F1) |
SpeechInstruct-th |
SALMONN-13B |
93.26 |
2.95 |
1.18 |
DiVA-8B |
50.12 |
15.13 |
2.68 |
Gemini-1.5-pro-001 |
81.32 |
62.10 |
3.93 |
Typhoon-Audio-Preview |
93.74 |
64.60 |
6.11 |
🔧 Technical Details
This model is experimental and may not always follow human instructions accurately, making it prone to generating hallucinations. Additionally, the model lacks moderation mechanisms and may produce harmful or inappropriate responses. Developers should carefully assess potential risks based on their specific applications.
⚠️ Important Note
This model is experimental and may not always follow human instructions accurately, making it prone to generating hallucinations. Additionally, the model lacks moderation mechanisms and may produce harmful or inappropriate responses. Developers should carefully assess potential risks based on their specific applications.
📄 License
The model is released under the Llama 3 Community License.
Follow us & Support
- https://twitter.com/opentyphoon
- https://discord.gg/us5gAYmrxw
Acknowledgements
We would like to thank the SALMONN team for open-sourcing their code and data, and thanks to the Biomedical and Data Lab at Mahidol University for releasing the fine-tuned Whisper that allowed us to adopt its encoder. Thanks to many other open-source projects for their useful knowledge sharing, data, code, and model weights.
Typhoon Team
Potsawee Manakul, Sittipong Sripaisarnmongkol, Natapong Nitarach, Warit Sirichotedumrong, Adisai Na-Thalang, Phatrasek Jirabovonvisut, Parinthapat Pengpun,
Krisanapong Jirayoot, Pathomporn Chokchainant, Kasima Tharnpipitchai, Kunat Pipatanakul