Open-source SIMS-7B Speech Language Model - Supports Interleaved Training of Speech and Text and Cross-modal Generation

SIMS 7B

Developed by slprl

A speech-language model based on Qwen2.5-7B extension, supporting speech-text interleaved training and cross-modal generation

Text-to-Audio

Transformers

EnglishOpen Source License:MIT #Speech-text interleaved training #Multimodal generation #Efficient scaling

Downloads 51

Release Time : 3/31/2025

Model Overview

This model is fine-tuned by extending the vocabulary of Qwen2.5-7B, adding 500 speech tokens, focusing on the scalability research of interleaved speech-text SLM. It can be used for generating speech segment continuations or cross-modal generation.

Model Features

Efficient Scalability

Compared to pure speech SLMs, it achieves higher computational resource utilization efficiency, with fundamentally different scaling dynamics.

Cross-modal Generation

Supports generating text continuations from speech prompts or generating speech continuations from speech-text prompts.

Knowledge Transfer

Achieves knowledge transfer through speech-text interleaved training initialized from pre-trained text language models.

Model Capabilities

Speech segment continuation generation

Speech-to-text cross-modal generation

Speech-text interleaved processing

Use Cases

Speech Generation

Speech Continuation Generation

Generates natural speech continuations based on input speech segments.

Performs comparably to mainstream models in speech semantic metrics.

Cross-modal Applications

Speech-to-Text Generation

Generates relevant text content based on speech prompts.

🚀 Scaling Analysis of Interleaved Speech-Text Language Models

This model focuses on the scaling analysis of interleaved speech - text language models, aiming to explore whether interleaved speech language models can scale more efficiently than text - less speech language models.

🚀 Quick Start

We refer users to the official repository for full usage explanations - github.

✨ Features

Conduct scaling analysis of interleaved speech - text language models.
Show that interleaved speech language models scale more efficiently with compute.
Indicate different scaling - dynamics compared to text - less speech language models.
Achieve comparable performance with leading models on speech semantic metrics while using less compute and data.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Paper Introduction

The model was presented in the paper Scaling Analysis of Interleaved Speech - Text Language Models.

Paper abstract

Existing Speech Language Model (SLM) scaling analysis paints a bleak picture. They predict that SLMs require much more compute and data compared to text, leading some to question the feasibility of training high - quality SLMs. However, modern SLMs are often initialised from pre - trained TextLMs using speech - text interleaving to allow knowledge transfer. This raises the question - Do interleaved SLMs scale more efficiently than textless - SLMs? In this paper we answer a resounding yes! We conduct scaling analysis of interleaved SLMs by training several dozen and analysing the scaling trends. We see that under this setup SLMs scale more efficiently with compute. Additionally, our results indicate that the scaling - dynamics are significantly different than textless - SLMs, suggesting one should allocate notably more of the compute budget for increasing model size over training tokens. We also study the role of synthetic data and TextLM model families in unlocking this potential. Results suggest, that our scaled up model achieves comparable performance with leading models on speech semantic metrics while using less compute and data than other approaches.

Model Card for Model ID

This is a Speech Language Model (SLM) trained for generating speech or text continuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert - base - 25hz) given speech - text prompts.

Model Details

Model Description

This Speech Language Model, introduced in "Scaling Analysis of Interleaved Speech - Text Language Models", focuses on scaling analysis of interleaved speech - text SLMs. It was fine - tuned from [Qwen/Qwen2.5 - 7B](https://huggingface.co/Qwen/Qwen2.5 - 7B) by extending its vocabulary with 500 speech tokens extracted from the 11 - th layer of [mhubert - 25hz](https://huggingface.co/slprl/mhubert - base - 25hz).

Property	Details
Developed by	SLP - RL
Model Type	SpeechLM
License	MIT
Finetuned from model	[Qwen/Qwen2.5 - 7B](https://huggingface.co/Qwen/Qwen2.5 - 7B)

Model Sources

Repository: [https://github.com/slp - rl/slamkit](https://github.com/slp - rl/slamkit)
Paper: https://arxiv.org/abs/2504.02398
Demo: [https://pages.cs.huji.ac.il/adiyoss - lab/sims/](https://pages.cs.huji.ac.il/adiyoss - lab/sims/)

Uses

This base SpeechLM can be used to generate continuations for speech segments, or cross - modal e.g generate a text contiuation to a speech prompt, or as a base for further tuning. See the SlamKit [codebase](https://github.com/slp - rl/slamkit) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss - lab/sims/) for some generation examples.

Out - of - Scope Use

This model was trained on diverse speech datasets, as such the outputs should not be treated as factual in any way.

Training Details

We highly encourage users to read the full paper, for full training details.

Compute Infrastructure

Hardware

This model was trained using 8 Nvidia H100 GPUs.

Software

The model was trained using the [SlamKit](https://github.com/slp - rl/slamkit) codebase which builds upon 🤗transformers extending it to support easy and efficient training of Speech Language Models.

🔧 Technical Details

The paper conducts a detailed scaling analysis of interleaved speech - text language models. By training several dozen models and analyzing the scaling trends, it is found that interleaved speech language models scale more efficiently with compute. The scaling - dynamics are significantly different from text - less speech language models, which provides guidance on how to allocate the compute budget between increasing model size and training tokens.

📄 License

The model is licensed under the MIT license.

📄 Citation

BibTeX:

@misc{maimon2025scaling,
      title={Scaling Analysis of Interleaved Speech - Text Language Models}, 
      author={Gallil Maimon and Michael Hassid and Amit Roth and Yossi Adi},
      year={2025},
      eprint={2504.02398},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2504.02398}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご