wav2vec2-large-xlsr-53-french Open Source Model - High-accuracy French Speech-to-Text Support

Wav2vec2 Large Xlsr 53 French

Developed by jonatasgrosman

This is a French speech recognition model fine-tuned from the XLSR-53 large model, trained on the Common Voice dataset, supporting high-accuracy French speech-to-text conversion.

Speech Recognition FrenchOpen Source License:Apache-2.0 #French speech recognition #Low word error rate #XLSR-53 fine-tuning

Downloads 47.83k

Release Time : 3/2/2022

Model Overview

This model is an automatic speech recognition (ASR) system optimized for French, fine-tuned based on Facebook's wav2vec2-large-xlsr-53 architecture, capable of converting French speech into text.

Model Features

High-precision French recognition

Achieves a word error rate (WER) of 17.65% and a character error rate (CER) of 4.89% on the Common Voice French test set.

Language model enhancement support

When combined with a language model, WER can be reduced to 13.59% and CER to 3.91%, significantly improving recognition accuracy.

16kHz sampling rate support

Optimized for 16kHz sampled speech input, suitable for most speech application scenarios.

Open-source license

Licensed under Apache-2.0, allowing for commercial and research use.

Model Capabilities

French speech recognition

Real-time speech-to-text

Batch audio processing

Use Cases

Speech transcription

French speech-to-text

Convert French speech content into editable text format

Achieves over 83% accuracy on standard test sets.

Voice assistants

French voice command recognition

Used for voice command recognition in French voice assistants or control systems

🚀 Fine-tuned XLSR-53 large model for speech recognition in French

This fine-tuned model is based on facebook/wav2vec2-large-xlsr-53 and trained on French data from Common Voice 6.1. It's designed for accurate French speech recognition.

🚀 Quick Start

This model is a fine-tuned version of facebook/wav2vec2-large-xlsr-53 on French, using the train and validation splits of Common Voice 6.1. When using this model, ensure that your speech input is sampled at 16kHz.

This model has been fine-tuned thanks to the GPU credits generously given by the OVHcloud :)

The script used for training can be found here: https://github.com/jonatasgrosman/wav2vec2-sprint

✨ Features

Audio Processing: Specialized for French speech recognition.
Fine-tuned Model: Based on a pre - trained large model and fine - tuned on French data.
Multiple Metrics: Evaluated using WER (Word Error Rate) and CER (Character Error Rate).

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

Using the HuggingSound library:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-french")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

Advanced Usage

Writing your own inference script:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "fr"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-french"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)

Usage Results

Reference	Prediction
"CE DERNIER A ÉVOLUÉ TOUT AU LONG DE L'HISTOIRE ROMAINE."	CE DERNIER ÉVOLUÉ TOUT AU LONG DE L'HISTOIRE ROMAINE
CE SITE CONTIENT QUATRE TOMBEAUX DE LA DYNASTIE ACHÉMÉNIDE ET SEPT DES SASSANIDES.	CE SITE CONTIENT QUATRE TOMBEAUX DE LA DYNASTIE ASHEMÉNID ET SEPT DES SASANDNIDES
"J'AI DIT QUE LES ACTEURS DE BOIS AVAIENT, SELON MOI, BEAUCOUP D'AVANTAGES SUR LES AUTRES."	JAI DIT QUE LES ACTEURS DE BOIS AVAIENT SELON MOI BEAUCOUP DAVANTAGES SUR LES AUTRES
LES PAYS-BAS ONT REMPORTÉ TOUTES LES ÉDITIONS.	LE PAYS-BAS ON REMPORTÉ TOUTES LES ÉDITIONS
IL Y A MAINTENANT UNE GARE ROUTIÈRE.	IL AMNARDIGAD LE TIRAN
HUIT	HUIT
DANS L’ATTENTE DU LENDEMAIN, ILS NE POUVAIENT SE DÉFENDRE D’UNE VIVE ÉMOTION	DANS L'ATTENTE DU LENDEMAIN IL NE POUVAIT SE DÉFENDRE DUNE VIVE ÉMOTION
LA PREMIÈRE SAISON EST COMPOSÉE DE DOUZE ÉPISODES.	LA PREMIÈRE SAISON EST COMPOSÉE DE DOUZE ÉPISODES
ELLE SE TROUVE ÉGALEMENT DANS LES ÎLES BRITANNIQUES.	ELLE SE TROUVE ÉGALEMENT DANS LES ÎLES BRITANNIQUES
ZÉRO	ZEGO

📚 Documentation

Evaluation

To evaluate on mozilla-foundation/common_voice_6_0 with split test

python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-french --dataset mozilla-foundation/common_voice_6_0 --config fr --split test

To evaluate on speech-recognition-community-v2/dev_data

python eval.py --model_id jonatasgrosman/wav2vec2-large-xlsr-53-french --dataset speech-recognition-community-v2/dev_data --config fr --split validation --chunk_length_s 5.0 --stride_length_s 1.0

Model Index

Name: XLSR Wav2Vec2 French by Jonatas Grosman
Results:
- Task: Automatic Speech Recognition
- Datasets:
  - Common Voice fr:
    - Metrics:
      - Test WER: 17.65
      - Test CER: 4.89
      - Test WER (+LM): 13.59
      - Test CER (+LM): 3.91
  - Robust Speech Event - Dev Data:
    - Metrics:
      - Dev WER: 34.35
      - Dev CER: 14.09
      - Dev WER (+LM): 24.72
      - Dev CER (+LM): 12.33

Other Information

Property	Details
Model Type	Fine - tuned XLSR-53 large model for French speech recognition
Training Data	Common Voice 6.1 (train and validation splits)
Metrics	WER, CER
Tags	audio, automatic - speech - recognition, fr, hf - asr - leaderboard, mozilla - foundation/common_voice_6_0, robust - speech - event, speech, xlsr - fine - tuning - week

📄 License

This project is licensed under the Apache-2.0 license.

📖 Citation

If you want to cite this model you can use this:

@misc{grosman2021xlsr53-large-french,
  title={Fine-tuned {XLSR}-53 large model for speech recognition in {F}rench},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-french}},
  year={2021}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご