Llasa 3B
Llasa is a text-to-speech (TTS) system based on LLaMA, which extends the capabilities of the language model by integrating speech tokens, supporting Chinese and English speech generation.
Speech Synthesis
Safetensors Supports Multiple Languages#Multilingual Speech Synthesis#Voice Prompt Generation#Large Language Model Extension
Downloads 55
Release Time : 5/15/2025
Model Overview
Llasa is a text-to-speech (TTS) system that extends the text-based LLaMA language model by integrating 65,536 speech tokens from the XCodec2 codebook. The model can generate speech solely from input text or by utilizing given voice prompts.
Model Features
Training and Inference Computational Extension
Supports extended computation during both training and inference phases to enhance model performance.
Multilingual Support
Supports speech generation in both Chinese and English.
Voice Prompt Generation
Capable of generating speech using given voice prompts.
Efficient Training
Training TTS is similar to training LLM, leveraging existing compression, acceleration, and fine-tuning methods for LLMs.
Model Capabilities
Text-to-Speech
Voice Prompt Generation
Chinese-English Speech Synthesis
Use Cases
Speech Synthesis
Voice Assistants
Generating natural speech for virtual assistants.
Produces high-quality speech output.
Audiobooks
Converting text content into speech.
Generates natural and fluent speech.
Voice Prompt Applications
Voice Style Transfer
Generating speech with a similar style based on given voice prompts.
Maintains consistency in voice style.
🚀 Llasa - Text-to-Speech Model
Llasa is a text-to-speech system that extends the LLaMA language model, enabling speech synthesis from text or with a speech prompt.
🚀 Quick Start
- Explore our collection: See our collection for all our TTS model uploads.
- Learn fine - tuning: Learn to fine - tune TTS models by reading our guide.
- Discover Unsloth Dynamic 2.0: Unsloth Dynamic 2.0 achieves superior accuracy & outperforms other leading quants.
Free Notebooks and Performance
Property | Details |
---|---|
Model Type | Llasa-3B, Whisper Large V3, Qwen3 (14B), Llama 3.2 Vision (11B) |
Training Data | 250,000 hours of Chinese - English speech data |
Model | Free Notebook Link | Performance | Memory Use |
---|---|---|---|
Llasa-3B | ▶️ Start on Colab | 1.5x faster | 58% less |
Whisper Large V3 | ▶️ Start on Colab | 1.5x faster | 50% less |
Qwen3 (14B) | ▶️ Start on Colab | 2x faster | 70% less |
Llama 3.2 Vision (11B) | ▶️ Start on Colab | 1.8x faster | 50% less |
Updates
- 2025-05-10: Sometimes I find that top_p = 0.95 and temperature = 0.9 produce more stable results.
- 2025-02-13: Add Llasa finetune instruction.
- 2025-02-07: Our paper LLaSA: Scaling Train - Time and Inference - Time Compute for LLaMA - based Speech Synthesis has been released!
Training and Testing
- Train from Scratch: If you want to train the model from scratch, use the LLaSA Training Repository.
- Scale for Test - Time Computation: If you want to experiment with scaling for test - time computation, use the LLaSA Testing Repository.
✨ Features
- Extended LLaMA: Llasa extends the text - based LLaMA (1B, 3B, and 8B) language model by incorporating speech tokens from the XCodec2 codebook.
- Diverse Speech Generation: Capable of generating speech either solely from input text or by utilizing a given speech prompt.
- Compatibility: The method is compatible with the Llama framework, allowing existing LLM techniques for compression, acceleration, and finetuning to be applied.
📦 Installation
Install XCodec2.
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf
llasa_3b ='HKUSTAudio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b)
model.eval()
model.to('cuda')
from xcodec2.modeling_xcodec2 import XCodec2Model
model_path = "HKUSTAudio/xcodec2"
Codec_model = XCodec2Model.from_pretrained(model_path)
Codec_model.eval().cuda()
input_text = 'Dealing with family secrets is never easy. Yet, sometimes, omission is a form of protection, intending to safeguard some from the harsh truths. One day, I hope you understand the reasons behind my actions. Until then, Anna, please, bear with me.'
# input_text = '突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道:"我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"'
def ids_to_speech_tokens(speech_ids):
speech_tokens_str = []
for speech_id in speech_ids:
speech_tokens_str.append(f"<|s_{speech_id}|>")
return speech_tokens_str
def extract_speech_ids(speech_tokens_str):
speech_ids = []
for token_str in speech_tokens_str:
if token_str.startswith('<|s_') and token_str.endswith('|>'):
num_str = token_str[4:-2]
num = int(num_str)
speech_ids.append(num)
else:
print(f"Unexpected token: {token_str}")
return speech_ids
#TTS start!
with torch.no_grad():
formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
# Tokenize the text
chat = [
{"role": "user", "content": "Convert the text to speech:" + formatted_text},
{"role": "assistant", "content": "<|SPEECH_GENERATION_START|>"}
]
input_ids = tokenizer.apply_chat_template(
chat,
tokenize=True,
return_tensors='pt',
continue_final_message=True
)
input_ids = input_ids.to('cuda')
speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
# Generate the speech autoregressively
outputs = model.generate(
input_ids,
max_length=2048, # We trained our model with a max length of 2048
eos_token_id= speech_end_id ,
do_sample=True,
top_p=1, # Adjusts the diversity of generated content
temperature=0.8, # Controls randomness in output
)
# Extract the speech tokens
generated_ids = outputs[0][input_ids.shape[1]:-1]
speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
# Convert token <|s_23456|> to int 23456
speech_tokens = extract_speech_ids(speech_tokens)
speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
# Decode the speech tokens to speech waveform
gen_wav = Codec_model.decode_code(speech_tokens)
sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
Advanced Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import soundfile as sf
llasa_3b ='HKUSTAudio/Llasa-3B'
tokenizer = AutoTokenizer.from_pretrained(llasa_3b)
model = AutoModelForCausalLM.from_pretrained(llasa_3b)
model.eval()
model.to('cuda')
from xcodec2.modeling_xcodec2 import XCodec2Model
model_path = "HKUSTAudio/xcodec2"
Codec_model = XCodec2Model.from_pretrained(model_path)
Codec_model.eval().cuda()
# only 16khz speech support!
prompt_wav, sr = sf.read("太乙真人.wav") # you can find wav in Files
#prompt_wav, sr = sf.read("Anna.wav") # English prompt
prompt_wav = torch.from_numpy(prompt_wav).float().unsqueeze(0)
prompt_text ="对,这就是我万人敬仰的太乙真人,虽然有点婴儿肥,但也掩不住我逼人的帅气。"
#promt_text = "A chance to leave him alone, but... No. She just wanted to see him again. Anna, you don't know how it feels to lose a sister. Anna, I'm sorry, but your father asked me not to tell you anything."
target_text = '突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道:"我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"'
#target_text = "Dealing with family secrets is never easy. Yet, sometimes, omission is a form of protection, intending to safeguard some from the harsh truths. One day, I hope you understand the reasons behind my actions. Until then, Anna, please, bear with me."
input_text = prompt_text + target_text
def ids_to_speech_tokens(speech_ids):
speech_tokens_str = []
for speech_id in speech_ids:
speech_tokens_str.append(f"<|s_{speech_id}|>")
return speech_tokens_str
def extract_speech_ids(speech_tokens_str):
speech_ids = []
for token_str in speech_tokens_str:
if token_str.startswith('<|s_') and token_str.endswith('|>'):
num_str = token_str[4:-2]
num = int(num_str)
speech_ids.append(num)
else:
print(f"Unexpected token: {token_str}")
return speech_ids
#TTS start!
with torch.no_grad():
# Encode the prompt wav
vq_code_prompt = Codec_model.encode_code(input_waveform=prompt_wav)
print("Prompt Vq Code Shape:", vq_code_prompt.shape )
vq_code_prompt = vq_code_prompt[0,0,:]
# Convert int 12345 to token <|s_12345|>
speech_ids_prefix = ids_to_speech_tokens(vq_code_prompt)
formatted_text = f"<|TEXT_UNDERSTANDING_START|>{input_text}<|TEXT_UNDERSTANDING_END|>"
# Tokenize the text and the speech prefix
chat = [
{"role": "user", "content": "Convert the text to speech:" + formatted_text},
{"role": "assistant", "content": "<|SPEECH_GENERATION_START|>" + ''.join(speech_ids_prefix)}
]
input_ids = tokenizer.apply_chat_template(
chat,
tokenize=True,
return_tensors='pt',
continue_final_message=True
)
input_ids = input_ids.to('cuda')
speech_end_id = tokenizer.convert_tokens_to_ids('<|SPEECH_GENERATION_END|>')
# Generate the speech autoregressively
outputs = model.generate(
input_ids,
max_length=2048, # We trained our model with a max length of 2048
eos_token_id= speech_end_id ,
do_sample=True,
top_p=1,
temperature=0.8,
)
# Extract the speech tokens
generated_ids = outputs[0][input_ids.shape[1]-len(speech_ids_prefix):-1]
speech_tokens = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)
# Convert token <|s_23456|> to int 23456
speech_tokens = extract_speech_ids(speech_tokens)
speech_tokens = torch.tensor(speech_tokens).cuda().unsqueeze(0).unsqueeze(0)
# Decode the speech tokens to speech waveform
gen_wav = Codec_model.decode_code(speech_tokens)
# if only need the generated part
# gen_wav = gen_wav[:,:,prompt_wav.shape[1]:]
sf.write("gen.wav", gen_wav[0, 0, :].cpu().numpy(), 16000)
📄 License
This model is licensed under the CC BY - NC 4.0 License, which prohibits free commercial use because of ethics and privacy concerns; detected violations will result in legal consequences.
⚠️ Important Note
This codebase is strictly prohibited from being used for any illegal purposes in any country or region. Please refer to your local laws about DMCA and other related laws.
Kokoro 82M
Apache-2.0
Kokoro is an open-source text-to-speech (TTS) model with 82 million parameters, renowned for its lightweight architecture and high audio quality, while also being fast and cost-effective.
Speech Synthesis English
K
hexgrad
2.0M
4,155
XTTS V2
Other
ⓍTTS is a revolutionary voice generation model that achieves cross-lingual voice cloning with just a 6-second audio clip, supporting 17 languages.
Speech Synthesis
X
coqui
1.7M
2,630
F5 TTS
F5-TTS is a flow matching-based voice synthesis model, focusing on fluent and faithful voice synthesis, especially suitable for scenarios like fairy tale narration.
Speech Synthesis
F
SWivid
851.49k
1,000
Bigvgan V2 22khz 80band 256x
MIT
BigVGAN is a general-purpose neural vocoder trained at scale, capable of generating high-quality audio waveforms from mel spectrograms.
Speech Synthesis
B
nvidia
503.23k
16
Speecht5 Tts
MIT
A SpeechT5 speech synthesis (text-to-speech) model fine-tuned on the LibriTTS dataset, supporting high-quality text-to-speech conversion.
Speech Synthesis
Transformers

S
microsoft
113.83k
760
Dia 1.6B
Apache-2.0
Dia is a 1.6 billion parameter text-to-speech model developed by Nari Labs, capable of generating highly realistic conversations directly from text, supporting emotional and tonal control, and producing non-verbal communication content.
Speech Synthesis
Safetensors English
D
nari-labs
80.28k
1,380
Csm 1b
Apache-2.0
CSM is a 1-billion-parameter voice generation model developed by Sesame, capable of generating RVQ audio encoding from text and audio inputs
Speech Synthesis
Safetensors English
C
sesame
65.03k
1,950
Kokoro 82M V1.1 Zh
Apache-2.0
Kokoro is an open-weight series of small yet powerful text-to-speech (TTS) models, now featuring data from 100 Chinese speakers sourced from professional datasets.
Speech Synthesis
K
hexgrad
51.56k
112
Indic Parler Tts
Apache-2.0
Indic Parler-TTS is a multilingual extension of Parler-TTS Mini, supporting 21 languages including various Indian languages and English.
Speech Synthesis
Transformers Supports Multiple Languages

I
ai4bharat
43.59k
124
Bark
MIT
Bark is a Transformer-based text-to-audio model created by Suno, capable of generating highly realistic multilingual speech, music, background noise, and simple sound effects.
Speech Synthesis
Transformers Supports Multiple Languages

B
suno
35.72k
1,326
Featured Recommended AI Models