UniSpeech Open-Source Speech Model - Trained with Multiple Types of Data, Fine-Tuned and Optimized Specifically for French Processing

Unispeech 1350 En 353 Fr Ft 1h

Developed by microsoft

UniSpeech is a unified speech representation learning model that combines labeled and unlabeled data for pre-training, specifically fine-tuned for French.

Speech Recognition

Transformers

French#French speech recognition #Phoneme-level modeling #Multi-task pre-training

Downloads 20

Release Time : 3/2/2022

Model Overview

This model is pre-trained on 16kHz sampled speech audio with phoneme labels and fine-tuned on 1 hour of French phoneme data, primarily designed for French automatic speech recognition tasks.

Model Features

Unified Learning Framework

Simultaneously performs supervised phoneme CTC learning and phoneme-aware contrastive self-supervised learning

Cross-lingual Capability

Demonstrates excellent cross-lingual representation learning on the CommonVoice corpus

Domain Adaptability

Excels in domain-transfer speech recognition tasks

Model Capabilities

French speech recognition

Phoneme sequence prediction

Cross-lingual speech representation learning

Use Cases

Speech Recognition

French Speech to Phoneme

Convert French speech into phoneme sequences

Compared to self-supervised pre-training and supervised transfer learning, it can reduce relative phoneme error rates by up to 13.4% and 17.8% respectively

Speech Research

Cross-lingual Speech Representation Research

Study speech representation transfer across different languages

🚀 UniSpeech-Large-plus FRENCH

Microsoft's UniSpeech is a large model pretrained on 16kHz sampled speech audio and phonetic labels, and fine - tuned on 1h of French phonemes. It's designed for automatic speech recognition tasks.

🚀 Quick Start

This model is based on Microsoft's UniSpeech. When using the model, ensure that your speech input is sampled at 16kHz and your text is converted into a sequence of phonemes.

✨ Features

The model is pretrained on 16kHz sampled speech audio and phonetic labels and fine - tuned on French phonemes.
It can capture information more correlated with phonetic structures and improve generalization across languages and domains.

📚 Documentation

Paper

UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data

Authors

Chengyi Wang, Yu Wu, Yao Qian, Kenichi Kumatani, Shujie Liu, Furu Wei, Michael Zeng, Xuedong Huang

Abstract

In this paper, we propose a unified pre - training approach called UniSpeech to learn speech representations with both unlabeled and labeled data, in which supervised phonetic CTC learning and phonetically - aware contrastive self - supervised learning are conducted in a multi - task learning manner. The resultant representations can capture information more correlated with phonetic structures and improve the generalization across languages and domains. We evaluate the effectiveness of UniSpeech for cross - lingual representation learning on public CommonVoice corpus. The results show that UniSpeech outperforms self - supervised pretraining and supervised transfer learning for speech recognition by a maximum of 13.4% and 17.8% relative phone error rate reductions respectively (averaged over all testing languages). The transferability of UniSpeech is also demonstrated on a domain - shift speech recognition task, i.e., a relative word error rate reduction of 6% against the previous approach.

Original Model

The original model can be found under https://github.com/microsoft/UniSpeech/tree/main/UniSpeech.

💻 Usage Examples

Basic Usage

import torch
from datasets import load_dataset
from transformers import AutoModelForCTC, AutoProcessor
import torchaudio.functional as F

model_id = "microsoft/unispeech-1350-en-353-fr-ft-1h"

sample = next(iter(load_dataset("common_voice", "fr", split="test", streaming=True)))
resampled_audio = F.resample(torch.tensor(sample["audio"]["array"]), 48_000, 16_000).numpy()

model = AutoModelForCTC.from_pretrained(model_id)
processor = AutoProcessor.from_pretrained(model_id)

input_values = processor(resampled_audio, return_tensors="pt").input_values

with torch.no_grad():
    logits = model(input_values).logits

prediction_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(prediction_ids)
# gives -> 'œ̃ v ʁ ɛ t ʁ a v a j ɛ̃ t e ʁ ɛ s ɑ̃ v a ɑ̃ f ɛ̃ ɛ t ʁ ə m ə n e s y ʁ s ə s y ʒ ɛ'
# for 'Un vrai travail intéressant va, enfin, être mené sur ce sujet.'

📄 License

The official license can be found here

📦 Additional Information

Contribution

The model was contributed by cywang and patrickvonplaten.

Official Results

See UniSpeeech - L^{+} - fr:

design

Property	Details
Model Type	Speech model fine - tuned on phoneme classification
Training Data	Common Voice

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご