Slam_scaled Open-source Speech and Language Model - 24-hour Training on a Single GPU, High-quality Speech Interaction Experience

Slam Scaled

Developed by slprl

A high-quality speech language model trained on a single GPU within 24 hours, fine-tuned based on Qwen2.5-0.5B, using Hubert tokens as vocabulary

Audio Generation

Transformers

Open Source License:MIT #Speech Continuation Generation #Single-GPU Efficient Training #Speech Language Model

Downloads 792

Release Time : 2/18/2025

Model Overview

A speech language model focused on speech segment generation, supporting efficient training and inference through discrete speech tokens

Model Features

Efficient Training

Only requires a single academic-grade GPU to complete high-quality model training within 24 hours

Speech Token Processing

Uses 500 speech tokens extracted from mhubert-25hz as vocabulary

Multi-Stage Optimization

Combines pre-training and DPO preference optimization to enhance generation quality

Low Resource Requirements

Only requires 2 A100 GPUs for 48 hours of training, with extremely low computational costs

Model Capabilities

Speech Segment Generation

Speech Continuation Prediction

Speech Token Processing

Use Cases

Speech Generation

Speech Story Continuation

Generates coherent follow-up content based on given speech segments

Achieved 61.30% accuracy on the sStoryCloze test set

Speech Interaction System

Serves as the generation component for speech dialogue systems

Educational Applications

Language Learning Assistance

Generates speech practice materials

🚀 Slamming: Training a Speech Language Model on One GPU in a Day

This project presents a method to train high - quality Speech Language Models (SLMs) on a single academic GPU in 24 hours, aiming to make SLM training and research more accessible.

🚀 Quick Start

We refer users to the official repository for full usage explanations - github.

✨ Features

Efficient Training: The model can be trained on a single academic GPU in 24 hours.
Good Scalability: The training recipe scales well with more compute, achieving results comparable to leading SLMs at a fraction of the compute cost.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

The README does not contain code examples, so this section is skipped.

📚 Documentation

Model Details

Model Description

This Speech Language Model, introduced in "Slamming: Training a Speech Language Model on One GPU in a Day", focuses on efficient training. It was fine - tuned from Qwen/Qwen2.5 - 0.5B over a vocabulary of 500 speech tokens extracted from the 11 - th layer of mhubert - 25hz.

The model was pre - trained using next - token prediction on a subset of LibriSpeech, Libri - Light and a synthetic dataset sTinyStories. It was subsequently fine - tuned with DPO on SpokenSwag.

Property	Details
Model Type	SpeechLM
Developed by	SLP - RL
License	MIT
Finetuned from model	Qwen/Qwen2.5 - 0.5B
Repository	https://github.com/slp-rl/slamkit
Paper	https://arxiv.org/abs/2502.15814
Demo	https://pages.cs.huji.ac.il/adiyoss-lab/slamming/

Uses

This base SpeechLM can be used to generate continuations for speech segments, or as a base for further tuning. See the SlamKit codebase for more details on usage, and checkout the demo page for some generation examples.

⚠️ Important Note

This model was trained on curated speech datasets which contain mainly audio - books and stories, as such the outputs should not be treated as factual in any way.

Training Details

We highly encourage users to read the full paper for full training details. A brief overview is provided below.

Training Data

This model was trained on a subset of LibriSpeech train, Libri - Light and the synthetic dataset sTinyStories for the pre - training phase. It was also trained with DPO on the synthetic dataset SpokenSwag.

Training Procedure

This model was trained by next token prediction over several datasets, and then trained with DPO over SpokenSwag. Please refer to the paper or code for the full training recipes.

Preprocessing

Speech tokens are extracted from the audio using Hubert - 25hz, and quantised using the official kmeans released with the model in textlesslib. Units are de - duplicated. We encourage you to explore the official repository for full details - github.

Evaluation

The paper provides full results. We give here some results and also refer to the demo page to listen to some samples.

Model	GPUs	Params	Num Tokens	sBLIMP ↑	sStoryCloze ↑	tStoryCloze ↑	GenPPL ↓	Auto - BLEU ↓
Speech only pre - training
GSLM	8×V100	100M	1B	54.2	53.3	66.6	—	—
SyllableLM	4×A40	300M	16B	63.7	—	75.4	—	—
TWIST - 350M	8×V100	305M	10.8B	56.2	—	—	137.3	3.46
TWIST - 1.3B	32×V100	1B	10.8B	57.0	52.4	70.6	131.8	3.20
TWIST - 7B	32×V100	7B	36B	59.0	55.3	74.1	93.74	3.06
TWIST - 13B	32×V100	13B	36B	59.2	55.4	76.4	—	—
Scaled Optimal	—	823M	82B	61.3	56.7	78.0	—	—
Moshi	?×H100	7B	?	58.9	58.7	81.8	—	—
SpiritLM	64×A100	7B	100B	58.0	54.8	72.9	—	—
With text / preference optimization
Scaling Interleaving	—	9B	~1T	—	62.4	82.9	—	—
Moshi	?×H100	7B	~720B	58.8	60.8	83.0	—	—
SpiritLM	64×A100	7B	100B	58.3	61.0	82.9	—	—
AlignSLM - 1.3B	64×A100	1B	10.8B + ~158B	59.8	55.0	80.0	—	—
AlignSLM - 7B	64×A100	7B	36B + ~158B	62.3	61.1	86.8	—	—
*Ours (Slam)*
Slam (-DPO)	2×A100	358M	16.7B	58.53	58.15	80.71	67.3	3.25
Slam	1×A5000	358M	1.4B + 5M	58.86	58.04	82.04	62.8	3.88
Slam (scaled)	2×A100	358M	16.7B + 9M	61.11	61.30	84.18	46.6	3.75

Compute Infrastructure

This model was trained as part of "Slamming: Training a Speech Language Model on One GPU in a Day", focusing on efficient training.

Hardware

This model was trained using only 2 Nvidia A100 GPU for 48 hours.

Software

The model was trained using the SlamKit codebase which builds upon 🤗transformers extending it to support easy and efficient training of Speech Language Models.

🔧 Technical Details

The model was presented in the paper Slamming: Training a Speech Language Model on One GPU in a Day. Through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components, it achieves efficient training of high - quality SLMs.

📄 License

This model is licensed under the MIT license.

Citation

BibTeX:

@misc{maimon2025slamming,
      title={Slamming: Training a Speech Language Model on One GPU in a Day}, 
      author={Gallil Maimon and Avishai Elmakies and Yossi Adi},
      year={2025},
      eprint={2502.15814},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.15814}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご