๐ Slamming: Training a Speech Language Model on One GPU in a Day
This project presents a method to train high - quality Speech Language Models (SLMs) on a single academic GPU in 24 hours, aiming to make SLM training and research more accessible.
๐ Quick Start
We refer users to the official repository for full usage explanations - github.
โจ Features
- Efficient Training: The model can be trained on a single academic GPU in 24 hours.
- Good Scalability: The training recipe scales well with more compute, achieving results comparable to leading SLMs at a fraction of the compute cost.
๐ฆ Installation
The README does not provide specific installation steps, so this section is skipped.
๐ป Usage Examples
The README does not contain code examples, so this section is skipped.
๐ Documentation
Model Details
Model Description
This Speech Language Model, introduced in "Slamming: Training a Speech Language Model on One GPU in a Day", focuses on efficient training.
It was fine - tuned from Qwen/Qwen2.5 - 0.5B over a vocabulary of 500 speech tokens extracted from
the 11 - th layer of mhubert - 25hz.
The model was pre - trained using next - token prediction on a subset of LibriSpeech, Libri - Light and a synthetic dataset
sTinyStories. It was subsequently fine - tuned with DPO on
SpokenSwag.
Uses
This base SpeechLM can be used to generate continuations for speech segments, or as a base for further tuning. See the SlamKit
codebase for more details on usage, and checkout the demo page for some generation examples.
โ ๏ธ Important Note
This model was trained on curated speech datasets which contain mainly audio - books and stories, as such the outputs should not be treated as factual in any way.
Training Details
We highly encourage users to read the full paper for full training details. A brief overview is provided below.
Training Data
This model was trained on a subset of LibriSpeech train,
Libri - Light and the synthetic dataset
sTinyStories for the pre - training phase. It was also trained with DPO on the synthetic
dataset SpokenSwag.
Training Procedure
This model was trained by next token prediction over several datasets, and then trained with DPO over SpokenSwag.
Please refer to the paper or code for the full training recipes.
Preprocessing
Speech tokens are extracted from the audio using Hubert - 25hz, and quantised using the
official kmeans released with the model in textlesslib. Units are de - duplicated.
We encourage you to explore the official repository for full details - github.
Evaluation
The paper provides full results. We give here some results and also refer to the demo page to listen to some samples.
Model |
GPUs |
Params |
Num Tokens |
sBLIMP โ |
sStoryCloze โ |
tStoryCloze โ |
GenPPL โ |
Auto - BLEU โ |
Speech only pre - training |
|
|
|
|
|
|
|
|
GSLM |
8รV100 |
100M |
1B |
54.2 |
53.3 |
66.6 |
โ |
โ |
SyllableLM |
4รA40 |
300M |
16B |
63.7 |
โ |
75.4 |
โ |
โ |
TWIST - 350M |
8รV100 |
305M |
10.8B |
56.2 |
โ |
โ |
137.3 |
3.46 |
TWIST - 1.3B |
32รV100 |
1B |
10.8B |
57.0 |
52.4 |
70.6 |
131.8 |
3.20 |
TWIST - 7B |
32รV100 |
7B |
36B |
59.0 |
55.3 |
74.1 |
93.74 |
3.06 |
TWIST - 13B |
32รV100 |
13B |
36B |
59.2 |
55.4 |
76.4 |
โ |
โ |
Scaled Optimal |
โ |
823M |
82B |
61.3 |
56.7 |
78.0 |
โ |
โ |
Moshi |
?รH100 |
7B |
? |
58.9 |
58.7 |
81.8 |
โ |
โ |
SpiritLM |
64รA100 |
7B |
100B |
58.0 |
54.8 |
72.9 |
โ |
โ |
With text / preference optimization |
|
|
|
|
|
|
|
|
Scaling Interleaving |
โ |
9B |
~1T |
โ |
62.4 |
82.9 |
โ |
โ |
Moshi |
?รH100 |
7B |
~720B |
58.8 |
60.8 |
83.0 |
โ |
โ |
SpiritLM |
64รA100 |
7B |
100B |
58.3 |
61.0 |
82.9 |
โ |
โ |
AlignSLM - 1.3B |
64รA100 |
1B |
10.8B + ~158B |
59.8 |
55.0 |
80.0 |
โ |
โ |
AlignSLM - 7B |
64รA100 |
7B |
36B + ~158B |
62.3 |
61.1 |
86.8 |
โ |
โ |
Ours (Slam) |
|
|
|
|
|
|
|
|
Slam (-DPO) |
2รA100 |
358M |
16.7B |
58.53 |
58.15 |
80.71 |
67.3 |
3.25 |
Slam |
1รA5000 |
358M |
1.4B + 5M |
58.86 |
58.04 |
82.04 |
62.8 |
3.88 |
Slam (scaled) |
2รA100 |
358M |
16.7B + 9M |
61.11 |
61.30 |
84.18 |
46.6 |
3.75 |
Compute Infrastructure
This model was trained as part of "Slamming: Training a Speech Language Model on One GPU in a Day", focusing on efficient training.
Hardware
This model was trained using only 2 Nvidia A100 GPU for 48 hours.
Software
The model was trained using the SlamKit codebase which builds upon ๐คtransformers extending it to support
easy and efficient training of Speech Language Models.
๐ง Technical Details
The model was presented in the paper Slamming: Training a Speech Language Model on One GPU in a Day. Through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components, it achieves efficient training of high - quality SLMs.
๐ License
This model is licensed under the MIT license.
Citation
BibTeX:
@misc{maimon2025slamming,
title={Slamming: Training a Speech Language Model on One GPU in a Day},
author={Gallil Maimon and Avishai Elmakies and Yossi Adi},
year={2025},
eprint={2502.15814},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.15814},
}