🚀 UniSpeech-Large-plus FRENCH
Microsoft's UniSpeech is a large model pretrained on 16kHz sampled speech audio and phonetic labels, and fine - tuned on 1h of French phonemes. It's designed for automatic speech recognition tasks.
🚀 Quick Start
This model is based on Microsoft's UniSpeech. When using the model, ensure that your speech input is sampled at 16kHz and your text is converted into a sequence of phonemes.
✨ Features
- The model is pretrained on 16kHz sampled speech audio and phonetic labels and fine - tuned on French phonemes.
- It can capture information more correlated with phonetic structures and improve generalization across languages and domains.
📚 Documentation
Paper
UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data
Authors
Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang
Abstract
In this paper, we propose a unified pre - training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically - aware contrastive self - supervised learning are conducted in a multi - task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve the generalization across languages and domains. We evaluate the effectiveness of UniSpeech for cross - lingual representation learning on public CommonVoice corpus. The results show that UniSpeech outperforms self - supervised pretraining and supervised transfer learning for speech recognition by a maximum of 13.4% and 17.8% relative phone error rate reductions respectively (averaged over all testing languages). The transferability of UniSpeech is also demonstrated on a domain - shift speech recognition task, i.e., a relative word error rate reduction of 6% against the previous approach.
Original Model
The original model can be found under https://github.com/microsoft/UniSpeech/tree/main/UniSpeech.
💻 Usage Examples
Basic Usage
import torch
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoProcessor
import torchaudio.functional as F
model_id = "microsoft/unispeech-1350-en-353-fr-ft-1h"
sample = next(iter(load_dataset("common_voice", "fr", split="test", streaming=True)))
resampled_audio = F.resample(torch.tensor(sample["audio"]["array"]), 48_000, 16_000).numpy()
model = AutoModelForCTC.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)
input_values = processor(resampled_audio, return_tensors="pt").input_values
with torch.no_grad():
logits = model(input_values).logits
prediction_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(prediction_ids)
📄 License
The official license can be found here
📦 Additional Information
Contribution
The model was contributed by cywang and patrickvonplaten.
Official Results
See UniSpeeech - L^{+} - fr:

Property |
Details |
Model Type |
Speech model fine - tuned on phoneme classification |
Training Data |
Common Voice |