🚀 Speechless
Speechless is a compact, open - source text - to - semantics model with 1B parameters. It directly generates semantic representations of audio as discrete tokens, eliminating the need for a text - to - speech (TTS) model. By bypassing the traditional TTS → ASR pipeline, it simplifies training, saves resources, and is scalable, especially for low - resource languages. Trained on about 400 hours of English and 1000 hours of Vietnamese data, it's a core part of the Ichigo v0.5 family.
🚀 Quick Start
You can use the following example code to load the model.
import torch
from transformers import pipeline
model_id = "homebrewltd/Speechless-llama3.2-v0.1"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
pipe("<|reserved_special_token_69|>I’m Speechless – A Model Developed by Homebrew Research")
>>> [{'generated_text': '<|reserved_special_token_69|>I’m Speechless – A Model Developed by Homebrew Research.assistant\n\n<|sound_1968|><|sound_0464|><|sound_0642|><|duration_02|><|sound_0634|><|sound_0105|><|duration_02|><|sound_1745|><|duration_02|><|sound_1345|><|sound_0210|><|sound_1312|><|sound_1312|>'}]
✨ Features
- Direct Semantic Generation: Generates direct semantic representations of audio as discrete tokens, eliminating the need for a TTS model.
- Resource - Efficient: Simplifies training and saves resources by bypassing the traditional TTS → ASR pipeline.
- Scalable: Suitable for low - resource languages.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
import torch
from transformers import pipeline
model_id = "homebrewltd/Speechless-llama3.2-v0.1"
pipe = pipeline(
"text-generation",
model=model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
pipe("<|reserved_special_token_69|>I’m Speechless – A Model Developed by Homebrew Research")
📚 Documentation
Model Summary
Property |
Details |
Developed by |
Homebrew Research |
Model Architecture |
Llama |
Model Type |
Text to Semantics |
Language(s) |
English and Vietnamese |
License |
Apache 2.0 |
Resources
Intended Use
- Intended Use Cases: This model is primarily designed for research purposes. This version focuses on generating direct semantic representations of audio as discrete tokens, eliminating the need for a text - to - speech (TTS) model.
- Out - of - scope: The use of Ichigo Whisper in any manner that violates applicable laws or regulations is strictly prohibited.
Training Specs
Parameter |
Value |
Epochs |
2 |
Global Batch Size |
144 |
Learning Rate |
3e - 4 |
Learning Scheduler |
Cosine |
Optimizer |
AdamW |
Warmup Ratio |
0.05 |
Weight Decay |
0.01 |
Max Sequence Length |
512 |
Clip Grad Norm |
1.0 |
Evaluation
Vietnamese
Model Name |
Dataset test |
Test samples |
WER |
Speechless v0.1 |
viet_bud500 |
7500 |
3.99 |
English
Model Name |
Dataset test |
Test samples |
WER |
Speechless v0.1 |
librispeech_asr |
2620 |
3.27 |
Citation Information
BibTeX:
@article{Speechless 2024,
title={Speechless},
author={Homebrew Research},
year=2024,
month=December},
url={https://huggingface.co/homebrewltd/Speechless-llama3.2-v0.1}
Acknowledgement
🔧 Technical Details
The model is trained on datasets including homebrewltd/Ichigo-tokenized-v0.1
. It is trained on about 400 hours of English and 1000 hours of Vietnamese data. The training parameters are as shown in the "Training Specs" section.
📄 License
This model is licensed under the Apache 2.0 license.