mhubert-base-25hz Open-source Speech Labeler - Free Support for SpeechLMs Model Training

Mhubert Base 25hz

Developed by slprl

A variant of Meta's Hubert model proposed in the TWIST paper, demonstrating significant value as a speech tokenizer for training SpeechLMs.

Speech Recognition

Transformers

Open Source License:MIT #25Hz speech features #Multilingual speech tokenization #Self-supervised speech representation

Downloads 10.63k

Release Time : 10/24/2024

Model Overview

This Hubert model is designed for speech feature extraction, suitable for scenarios like spoken language modeling or speaking style conversion.

Model Features

25Hz feature rate

Added a stride-2 convolutional layer to the CNN encoder, ultimately generating 25Hz features.

Multilingual support

Trained using a combination of multiple multilingual datasets.

Speech tokenizer

Demonstrates significant value when training SpeechLMs.

Model Capabilities

Speech feature extraction

Spoken language modeling

Speaking style conversion

Use Cases

Speech processing

Spoken language modeling

Used for building spoken language models

Speaking style conversion

Used for speaker style conversion

🚀 Model Card for mhubert-base-25hz

This is a version of Hubert by Meta, introduced in TWIST, which shows great value as a speech tokeniser for training SpeechLMs.

🚀 Quick Start

This model requires a new version of transformers - transformers>=4.48. Make sure you have it installed. Then, you can use the model as follows:

from transformers import HubertModel
model = HubertModel.from_pretrained('slprl/mhubert-base-25hz')

✨ Features

This Hubert model serves as a valuable feature extractor for speech tokenisation, applicable in scenarios like Spoken Language Modelling or Speaking Style Conversion.

📚 Documentation

Model Details

Model Description

This Hubert model was introduced in TWIST. We encourage you to refer to it for comprehensive details.

It was trained on a diverse mixture of datasets: Multilingual LS, Vox Populi, Common Voice, Spotify, and Fisher. This Hubert base model was trained for 3 iterations with the default 50Hz features rate. For the 4 - th iteration, an additional convolutional layer was added at the CNN Encoder with a stride of 2, resulting in features of 25Hz.

We converted the original Fairseq release to Huggingface🤗 using the conversion script after adding support and verified that the results are identical.

Property	Details
Developed by	Hassid et. al
Shared by	SLP - RL
Model Type	`transformers.HubertModel`
Languages	Multi - lingual
License	MIT, see textlesslib license for full details

Model Sources

Repository: https://github.com/facebookresearch/textlesslib/tree/main/examples/twist
Paper: https://arxiv.org/abs/2305.13009

📄 License

The model is under the MIT license. See textlesslib license for full details.

📚 Citation

BibTeX:

@article{hassid2024textually,
  title={Textually pretrained speech language models},
  author={Hassid, Michael and Remez, Tal and Nguyen, Tu Anh and Gat, Itai and Conneau, Alexis and Kreuk, Felix and Copet, Jade and Defossez, Alexandre and Synnaeve, Gabriel and Dupoux, Emmanuel and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}

👥 Model Card Authors

Gallil Maimon

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご