🚀 SLAM Model Card
This is a Speech Language Model designed to generate speech continuations over discrete Hubert tokens, offering efficient speech processing capabilities.
✨ Features
📚 Documentation
📋 Model Details
Model Description
This Speech Language Model, presented in "Slamming: Training a Speech Language Model on One GPU in a Day", emphasizes efficient training. It was fine - tuned from Qwen/Qwen2.5 - 0.5B using a vocabulary of 500 speech tokens extracted from the 11 - th layer of mhubert - 25hz. For a more powerful version trained with slightly more compute (2*A100 for 2 days), refer to slam_scaled.
The model was trained by next - token prediction on a subset of LibriSpeech, Libri - Light, and the synthetic data sTinyStories, and then trained with DPO on SpokenSwag.
Model Sources
🛠️ Uses
This is a base SpeechLM that can generate continuations for speech segments or serve as a base for further tuning. For more usage details, see the SlamKit [codebase](https://github.com/slp - rl/slamkit), and check out the [demo page](https://pages.cs.huji.ac.il/adiyoss - lab/slamming/) for some generation examples.
⚠️ Important Note
This model was trained on curated speech datasets mainly containing audio - books and stories. Thus, the outputs should not be considered factual in any way.
🚀 Quick Start
Refer to the official repository for full usage explanations - [github](https://github.com/slp - rl/slamkit).
🔧 Technical Details
Training Data
This model was pre - trained on a subset of LibriSpeech train, [Libri - Light](https://ai.meta.com/tools/libri - light/), and the synthetic dataset sTinyStories. It was also trained with DPO on the synthetic dataset SpokenSwag.
Training Procedure
The model was trained by next - token prediction on several datasets and then trained with DPO on SpokenSwag. For the full training recipes, please refer to the paper or [code](https://github.com/slp - rl/slamkit).
Preprocessing
Speech tokens are extracted from the audio using [Hubert - 25hz](https://huggingface.co/slprl/mhubert - base - 25hz) and quantized using the official kmeans released with the model in textlesslib. Units are de - duplicated. Explore the official repository for full details - [github](https://github.com/slp - rl/slamkit).
📊 Evaluation
The paper provides full results. Here are some results, and you can also refer to the [demo page](https://pages.cs.huji.ac.il/adiyoss - lab/slamming/) to listen to some samples.
Model |
Compute (GPU days) |
Parameters |
sBLIMP ↑ |
sStoryCloze ↑ |
tStoryCloze ↑ |
GenPPL ↓ |
Auto - BLEU ↓ |
[TWIST - 1.3B](https://pages.cs.huji.ac.il/adiyoss - lab/twist/) |
160xV100 |
1B |
57.00 |
52.4 |
70.6 |
131.8 |
3.20 |
[TWIST - 7B](https://pages.cs.huji.ac.il/adiyoss - lab/twist/) |
? |
7B |
59.00 |
55.3 |
74.1 |
93.7 |
3.06 |
[TWIST - 13B](https://pages.cs.huji.ac.il/adiyoss - lab/twist/) |
? |
13B |
59.20 |
55.4 |
76.4 |
- |
- |
Scaled Optimal |
? |
823M |
61.3 |
56.7 |
78.0 |
- |
- |
Predicted Optimal |
1xA5000 |
78M |
56.85 |
54.09 |
70.49 |
- |
- |
TWIST - 350M (Original recipe) |
1xA5000 |
305M |
51.52 ± .19 |
53.65 ± .57 |
68.80 ± .47 |
259.2 ± 6.7 |
3.26 ± .46 |
Slam (-DPO) (ours) |
1xA5000 |
358M |
56.45 ± .17 |
55.59 ± .30 |
78.01 ± .27 |
88.3 ± 1.0 |
3.47 ± .17 |
Slam (ours) |
1xA5000 |
358M |
58.86 ± .20 |
58.04 ± .51 |
82.04 ± .21 |
62.8 ± 4.1 |
3.88 ± .11 |
Compute Infrastructure
This model was trained as part of "Slamming: Training a Speech Language Model on One GPU in a Day", focusing on efficient training.
Hardware
The model was trained using only a single Nvidia A5000 GPU, 16 CPU cores, and 24 GB of RAM for 24 hours.
Software
The model was trained using the [SlamKit](https://github.com/slp - rl/slamkit) codebase, which builds upon 🤗transformers and extends it to support easy and efficient training of Speech Language Models.
📄 License
This model is released under the MIT license.
📖 Citation
BibTeX:
@misc{maimon2025slamming,
title={Slamming: Training a Speech Language Model on One GPU in a Day},
author={Gallil Maimon and Avishai Elmakies and Yossi Adi},
year={2025},
eprint={2502.15814},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.15814},
}