Llama3.1 Typhoon2 Audio 8b Instruct

Developed by scb10x

Typhoon 2-Audio Edition is an end-to-end speech-to-speech model architecture capable of processing audio, speech, and text inputs while simultaneously generating both text and speech outputs. The model is specifically optimized for Thai language while also supporting English.

Text-to-Audio

Transformers

Supports Multiple Languages#Thai speech processing #End-to-end speech model #Multimodal dialogue

Downloads 664

Release Time : 12/13/2024

Model Overview

A speech-to-speech model based on the Typhoon 2 large language model, supporting Thai and English speech input and output with text generation and speech synthesis capabilities.

Model Features

Multimodal input/output

Supports audio, speech, and text inputs while simultaneously generating both text and speech outputs

Thai language optimization

Specifically optimized for Thai language, providing high-quality Thai speech processing capabilities

End-to-end architecture

Complete speech-to-speech processing pipeline without requiring additional intermediate steps

Multi-turn dialogue support

Supports complex multi-turn dialogue interactions while maintaining contextual consistency

Model Capabilities

Speech recognition

Speech synthesis

Text generation

Speech-to-speech

Multilingual processing

Dialogue system

Use Cases

Voice assistant

Thai voice assistant

Building Thai voice assistants supporting voice input and output

Achieved 7.19/10 in Thai speech quality evaluation

Speech transcription

Thai speech transcription

Transcribing Thai speech content into text

14.04% WER for Thai ASR

Speech translation

English-Thai speech translation

Translating English speech to Thai text or speech

27.15 BLEU score for English-to-Thai translation

library_name: transformers license: llama3.1 language:

th
en pipeline_tag: text-generation

Typhoon2-Audio

Typhoon2-Audio is an end-to-end speech-to-speech model architecture capable of processing audio, speech, and text inputs and generating both text and speech outputs simultaneously. It is optimized specifically for Thai, but it also supports English.

GitHub: https://github.com/scb-10x/typhoon2-audio/
Demo: https://audio.opentyphoon.ai/
Paper: https://arxiv.org/abs/2412.13702

Model Description

Model type: The LLM is based on Typhoon2 LLM.
Requirement: Python==3.10 & transformers==4.52.2 & fairseq==0.12.2 & flash-attn
Primary Language(s): Thai 🇹🇭 and English 🇬🇧
License-Speech-Input & LLM: Llama 3.1 Community License
License-Speech-Output: CC-BY-NC

Installation

pip install pip==24.0
pip install transformers==4.45.2
pip install fairseq==0.12.2 # fairseq required pip==24.0 to install & only worked only on python 3.10
pip install flash-attn

Usage

Load Model

import torch
from transformers import AutoModel
model = AutoModel.from_pretrained(
    "scb10x/llama3.1-typhoon2-audio-8b-instruct",
    torch_dtype=torch.float16, 
    trust_remote_code=True
)
model.to("cuda")

Inference - Single turn example

conversation = [
    {"role": "system", "content": "You are a helpful female assistant named ไต้ฝุ่น."},
    {
        "role": "user",
        "content": [
            {
                "type": "audio",
                "audio_url": "examples/tmp-2860cd0a094b64043226167340af03a3.wav",
            },
            {"type": "text", "text": "Transcribe this audio"},
        ],
    },
]
x = model.generate(
    conversation=conversation,
    max_new_tokens=500,
    do_sample=True,
    num_beams=1,
    top_p=0.9,
    repetition_penalty=1.0,
    length_penalty=1.0,
    temperature=0.7,
)
# x => x['text'] (text), x['audio'] (numpy array)
# to save the audio output
# import soundfile as sf
# sf.write("examples/speechout.wav", x["audio"]["array"], x["audio"]["sampling_rate"])

Inference - Multi turn example

conversation_multi_turn = [
    {
        "role": "system",
        "content": "You are a helpful female assistant named ไต้ฝุ่น. Respond conversationally to the speech provided in the language it is spoken in.",
    },
    {
        "role": "user",
        "content": [
            {
                "type": "audio",
                "audio_url": "examples/tmp-2860cd0a094b64043226167340af03a3.wav",
                # บอกชื่อเมืองใหญ่ๆในอเมริกามาให้หน่อยสิ -- "List some names of US cities"
            },
            {
                "type": "text",
                "text": "",
            },
        ],
    },
    {
        "role": "assistant",
        "content": [
            {
                "type": "text",
                "text": "โอเคค่ะ, ฉันจะบอกชื่อเมืองใหญ่ๆ ในอเมริกาให้คุณฟัง:\n\n1. นิวยอร์ก\n2. ลอสแอนเจลิส\n3. ชิคาโก\n4. ฮิวสตัน\n5. ฟิลาเดลเฟีย\n6. บอสตัน\n7. ซานฟรานซิสโก\n8. วอชิงตัน ดี.ซี. (Washington D.C.)\n9. แอตแลนต้า\n10. ซีแอตเทิล\n\nถ้าคุณต้องการข้อมูลเพิ่มเติมหรือมีคำถามอื่นๆ กรุณาถามได้เลยค่ะ'",
            },
        ],
    },
    {
        "role": "user",
        "content": [
            {
                "type": "audio",
                "audio_url": "examples/tmp-2284cd76e1c875525ff75327a2fc3610.wav",
                # แล้วถ้าเป็นประเทศอังกฤษล่ะ -- "How about the UK"

            },
        ],
    },
]
x = model.generate(conversation=conversation_multi_turn)
# x => x['text'] (text), x['audio'] (numpy array)
# to save the audio output
# import soundfile as sf
# sf.write("examples/speechout.wav", x["audio"]["array"], x["audio"]["sampling_rate"])

TTS functionality

y = model.synthesize_speech("Hello, my name is ไต้ฝุ่น I am a language model specialized in Thai")
# y => numpy array

Evaluation Results

1) Audio and Speech Understanding

Model	ASR-en (WER↓)	ASR-th (WER↓)	En2Th (BLEU↑)	X2Th (BLEU↑)	Th2En (BLEU↑)
SALMONN-13B	5.79	98.07	0.07	0.10	14.97
DiVA-8B	30.28	65.21	9.82	5.31	7.97
Gemini-1.5-pro-001	5.98	13.56	20.69	13.52	22.54
Typhoon-Audio	8.72	14.17	17.52	10.67	24.14
Typhoon2-Audio	5.83	14.04	27.15	15.93	33.25

Model	Gender-th (Acc)	SpokenQA-th (F1)	SpeechInstruct-(en,th)
SALMONN-13B	93.26	2.95	2.47, 1.18
DiVA-8B	50.12	15.13	6.81, 2.68
Gemini-1.5-pro-001	81.32	62.10	3.24, 3.93
Typhoon-Audio	93.74	64.60	5.62, 6.11
Typhoon2-Audio	75.65	70.01	6.00, 6.79

2) Speech-to-Speech Evaluation
2.1) Content Generation

Model	SpeechIF(En)-Quality	SpeechIF(En)-Style	SpeechIF(Th)-Quality	SpeechIF(Th)-Style
Llama-Omni	5.15	5.79	1.71	2.14
GPT-4o-Audio	6.82	7.86	6.66	8.07
Typhoon2-Audio	4.92	5.39	7.19	8.04

2.2) Speech Quality

Model	SpeechIF(En)-CER	SpeechIF(En)-UTMOS	SpeechIF(Th)-CER	SpeechIF(Th)-UTMOS
Llama-Omni*	3.40	3.93	6.30	3.93
GPT-4o-Audio	3.20	3.65	8.05	3.46
Typhoon2-Audio	26.50	2.29	8.67	2.35

*Note that Llama-Omni does not generate Thai text/speech, so it has low CER and high UTMOS due to the outputs being English.

Intended Uses & Limitations

This model is experimental and may not always follow human instructions accurately, making it prone to generating hallucinations. Additionally, the model lacks moderation mechanisms and may produce harmful or inappropriate responses. Developers should carefully assess potential risks based on their specific applications.

Follow us & Support

https://twitter.com/opentyphoon
https://discord.gg/us5gAYmrxw

Acknowledgements

We would like to thank the SALMONN team and the Llama-Omni team for open-sourcing their code and data, and thanks to the Biomedical and Data Lab at Mahidol University for releasing the fine-tuned Whisper that allowed us to adopt its encoder. Thanks to many other open-source projects for their useful knowledge sharing, data, code, and model weights.

Typhoon Team

Potsawee Manakul, Warit Sirichotedumrong, Kunat Pipatanakul, Pittawat Taveekitworachai, Natapong Nitarach, Surapon Nonesung, Teetouch Jaknamon, Parinthapat Pengpun, Pittawat Taveekitworachai, Adisai Na-Thalang, Sittipong Sripaisarnmongkol, Krisanapong Jirayoot, Kasima Tharnpipitchai

Citation

If you find Typhoon2 useful for your work, please cite it using:

@misc{typhoon2,
      title={Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models}, 
      author={Kunat Pipatanakul and Potsawee Manakul and Natapong Nitarach and Warit Sirichotedumrong and Surapon Nonesung and Teetouch Jaknamon and Parinthapat Pengpun and Pittawat Taveekitworachai and Adisai Na-Thalang and Sittipong Sripaisarnmongkol and Krisanapong Jirayoot and Kasima Tharnpipitchai},
      year={2024},
      eprint={2412.13702},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2412.13702}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご