Speecht5 ASR Open-source Automatic Speech Recognition Model - Free Deployment, Quickly Convert Speech to Text

Speecht5 Asr

Developed by microsoft

A SpeechT5 automatic speech recognition model fine-tuned on the LibriSpeech dataset, supporting speech-to-text conversion.

Speech Recognition

Transformers

Open Source License:MIT #Speech-to-Text #Cross-modal Pretraining #High-precision ASR

Downloads 12.30k

Release Time : 2/2/2023

Model Overview

SpeechT5 is a unified encoder-decoder pre-training framework designed for spoken language processing tasks, supporting various applications such as speech recognition.

Model Features

Unified Modal Framework

Processes speech and text through a shared encoder-decoder network to achieve cross-modal representation learning.

Cross-modal Vector Quantization

Uses random mixing of speech/text states with latent units to align text and speech information in a unified semantic space.

Multi-task Support

Not only supports speech recognition but can also be used for speech synthesis, speech translation, voice conversion, and other spoken language processing tasks.

Model Capabilities

Speech Recognition

Speech-to-Text

Use Cases

Speech Processing

Automatic Speech Recognition

Converts speech content into text, suitable for meeting transcripts, voice assistants, and other scenarios.

Performs excellently on the LibriSpeech dataset.

## 🚀 SpeechT5 (ASR task)

*A fine-tuned SpeechT5 model for automatic speech recognition (speech-to-text) on LibriSpeech.*

## 🚀 Quick Start
Use the code below to convert a mono 16 kHz speech waveform to text.

### Basic Usage
```python
from transformers import SpeechT5Processor, SpeechT5ForSpeechToText
from datasets import load_dataset

dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
dataset = dataset.sort("id")
sampling_rate = dataset.features["audio"].sampling_rate
example_speech = dataset[0]["audio"]["array"]

processor = SpeechT5Processor.from_pretrained("microsoft/speecht5_asr")
model = SpeechT5ForSpeechToText.from_pretrained("microsoft/speecht5_asr")

inputs = processor(audio=example_speech, sampling_rate=sampling_rate, return_tensors="pt")

predicted_ids = model.generate(**inputs, max_length=100)

transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
print(transcription[0])

✨ Features

This model was introduced in SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing by Junyi Ao, Rui Wang, Long Zhou, Chengyi Wang, Shuo Ren, Yu Wu, Shujie Liu, Tom Ko, Qing Li, Yu Zhang, Zhihua Wei, Yao Qian, Jinyu Li, Furu Wei.
SpeechT5 was first released in this repository, original weights. The license used is MIT.
Disclaimer: The team releasing SpeechT5 did not write a model card for this model so this model card has been written by the Hugging Face team.

📚 Documentation

Model Description

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder.

Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder.

Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.

Intended Uses & Limitations

You can use this model for automatic speech recognition. See the model hub to look for fine-tuned versions on a task that interests you.

Currently, both the feature extractor and model support PyTorch.

📄 License

The license used is MIT.

🔧 Technical Details

Citation

BibTeX:

@inproceedings{ao-etal-2022-speecht5,
    title = {{S}peech{T}5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing},
    author = {Ao, Junyi and Wang, Rui and Zhou, Long and Wang, Chengyi and Ren, Shuo and Wu, Yu and Liu, Shujie and Ko, Tom and Li, Qing and Zhang, Yu and Wei, Zhihua and Qian, Yao and Li, Jinyu and Wei, Furu},
    booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
    month = {May},
    year = {2022},
    pages={5723--5738},
}


This English version of the README has been beautified according to the requirements. It includes emojis, a clear information architecture, and maintains the original technical details and code examples. The content is organized into relevant sections for better readability.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご