SLAM Open-source Speech and Language Model - Free Deployment, Efficient Training and Generation of Speech Segment Continuations

Slam

Developed by slprl

This is a speech language model based on discrete Hubert tokens, focusing on efficient training and capable of generating speech segment continuations.

Audio Generation

Transformers

Open Source License:MIT #Speech Continuation Generation #Efficient Training #Hubert Tokens

Downloads 115

Release Time : 2/18/2025

Model Overview

This model is fine-tuned from Qwen/Qwen2.5-0.5B, based on a vocabulary of 500 speech tokens extracted from the 11th layer of mhubert-25hz. It can be used to generate speech segment continuations or serve as a foundation for further tuning.

Model Features

Efficient Training

Utilizes the method proposed in the paper 'Slamming,' enabling training completion within one day using a single GPU.

Speech Token Processing

Based on a vocabulary of 500 speech tokens extracted from the 11th layer of mhubert-25hz.

DPO Training

Trained with DPO on the SpokenSwag dataset to optimize generation quality.

Model Capabilities

Speech segment continuation generation

Speech language model fine-tuning foundation

Use Cases

Speech Generation

Speech Story Continuation

Generates coherent follow-up content based on a given speech story segment.

Useful for audiobook creation or voice interaction applications

Speech Dialogue Continuation

Generates natural responses in voice dialogue systems.

Enhances the naturalness and coherence of dialogue systems

🚀 SLAM Model Card

This is a Speech Language Model designed to generate speech continuations over discrete Hubert tokens, offering efficient speech processing capabilities.

✨ Features

Efficient Training: Focuses on efficient training, as introduced in "Slamming: Training a Speech Language Model on One GPU in a Day".
Fine - Tuned Model: Fine - tuned from Qwen/Qwen2.5 - 0.5B over a vocabulary of 500 speech tokens.
Versatile Usage: Can be used to generate speech continuations or as a base for further tuning.

📚 Documentation

📋 Model Details

Model Description

This Speech Language Model, presented in "Slamming: Training a Speech Language Model on One GPU in a Day", emphasizes efficient training. It was fine - tuned from Qwen/Qwen2.5 - 0.5B using a vocabulary of 500 speech tokens extracted from the 11 - th layer of mhubert - 25hz. For a more powerful version trained with slightly more compute (2*A100 for 2 days), refer to slam_scaled.

The model was trained by next - token prediction on a subset of LibriSpeech, Libri - Light, and the synthetic data sTinyStories, and then trained with DPO on SpokenSwag.

Developed by: SLP - RL
Model type: SpeechLM
License: MIT
Finetuned from model: Qwen/Qwen2.5 - 0.5B

Model Sources

Repository: https://github.com/slp - rl/slamkit
Paper: https://arxiv.org/abs/2502.15814
Demo: https://pages.cs.huji.ac.il/adiyoss - lab/slamming/

🛠️ Uses

This is a base SpeechLM that can generate continuations for speech segments or serve as a base for further tuning. For more usage details, see the SlamKit [codebase](https://github.com/slp - rl/slamkit), and check out the [demo page](https://pages.cs.huji.ac.il/adiyoss - lab/slamming/) for some generation examples.

⚠️ Important Note

This model was trained on curated speech datasets mainly containing audio - books and stories. Thus, the outputs should not be considered factual in any way.

🚀 Quick Start

Refer to the official repository for full usage explanations - [github](https://github.com/slp - rl/slamkit).

🔧 Technical Details

Training Data

This model was pre - trained on a subset of LibriSpeech train, [Libri - Light](https://ai.meta.com/tools/libri - light/), and the synthetic dataset sTinyStories. It was also trained with DPO on the synthetic dataset SpokenSwag.

Training Procedure

The model was trained by next - token prediction on several datasets and then trained with DPO on SpokenSwag. For the full training recipes, please refer to the paper or [code](https://github.com/slp - rl/slamkit).

Preprocessing

Speech tokens are extracted from the audio using [Hubert - 25hz](https://huggingface.co/slprl/mhubert - base - 25hz) and quantized using the official kmeans released with the model in textlesslib. Units are de - duplicated. Explore the official repository for full details - [github](https://github.com/slp - rl/slamkit).

📊 Evaluation

The paper provides full results. Here are some results, and you can also refer to the [demo page](https://pages.cs.huji.ac.il/adiyoss - lab/slamming/) to listen to some samples.

Model	Compute (GPU days)	Parameters	sBLIMP ↑	sStoryCloze ↑	tStoryCloze ↑	GenPPL ↓	Auto - BLEU ↓
[TWIST - 1.3B](https://pages.cs.huji.ac.il/adiyoss - lab/twist/)	160xV100	1B	57.00	52.4	70.6	131.8	3.20
[TWIST - 7B](https://pages.cs.huji.ac.il/adiyoss - lab/twist/)	?	7B	59.00	55.3	74.1	93.7	3.06
[TWIST - 13B](https://pages.cs.huji.ac.il/adiyoss - lab/twist/)	?	13B	59.20	55.4	76.4	-	-
Scaled Optimal	?	823M	61.3	56.7	78.0	-	-
Predicted Optimal	1xA5000	78M	56.85	54.09	70.49	-	-
TWIST - 350M (Original recipe)	1xA5000	305M	51.52 ± .19	53.65 ± .57	68.80 ± .47	259.2 ± 6.7	3.26 ± .46
Slam (-DPO) (ours)	1xA5000	358M	56.45 ± .17	55.59 ± .30	78.01 ± .27	88.3 ± 1.0	3.47 ± .17
Slam (ours)	1xA5000	358M	58.86 ± .20	58.04 ± .51	82.04 ± .21	62.8 ± 4.1	3.88 ± .11

Compute Infrastructure

This model was trained as part of "Slamming: Training a Speech Language Model on One GPU in a Day", focusing on efficient training.

Hardware

The model was trained using only a single Nvidia A5000 GPU, 16 CPU cores, and 24 GB of RAM for 24 hours.

Software

The model was trained using the [SlamKit](https://github.com/slp - rl/slamkit) codebase, which builds upon 🤗transformers and extends it to support easy and efficient training of Speech Language Models.

📄 License

This model is released under the MIT license.

📖 Citation

BibTeX:

@misc{maimon2025slamming,
      title={Slamming: Training a Speech Language Model on One GPU in a Day}, 
      author={Gallil Maimon and Avishai Elmakies and Yossi Adi},
      year={2025},
      eprint={2502.15814},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2502.15814}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご