wav2vec2-large-xlsr-53-french_punctuation Open-source Model - French Speech Recognition Supporting Punctuation Prediction

Wav2vec2 Large Xlsr 53 French Punctuation

Developed by Ilyes

A French automatic speech recognition model based on the wav2vec2-large-xlsr-53 architecture, supporting punctuation prediction

Speech Recognition FrenchOpen Source License:Apache-2.0 #French speech recognition #Automatic punctuation generation #XLSR fine-tuning

Downloads 23

Release Time : 3/2/2022

Model Overview

This model is a fine-tuned version of wav2vec2-large-xlsr-53 specifically designed for French speech recognition, capable of handling speech transcription tasks with punctuation.

Model Features

Punctuation prediction

Can automatically predict and add punctuation marks to improve the readability of the transcribed text

High accuracy

Achieved excellent performance with a WER of 19.47% and a CER of 6.66% on the Common Voice French test set

XLSR fine-tuning

Fine-tuned based on the cross-lingual speech representation (XLSR) pre-trained model, with powerful speech feature extraction capabilities

Model Capabilities

French speech recognition

Automatic punctuation prediction

Speech-to-text

Use Cases

Speech transcription

Meeting minutes

Automatically transcribe French meeting recordings and add punctuation marks

Improve transcription efficiency and text readability

Media subtitle generation

Generate subtitles with punctuation for French video content

Save time on manual subtitle production

Voice assistant

French voice input

Support French voice command recognition and processing

Enhance the voice interaction experience

🚀 Wav2Vec2 Large XLSR 53 French Punctuation Model

This model is fine - tuned for French speech recognition, capable of predicting text and punctuation, and has been evaluated on the Common Voice French dataset.

🚀 Quick Start

The following steps and code examples demonstrate how to evaluate this model on the Common Voice French test dataset.

💻 Usage Examples

Basic Usage

import re
import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
    Wav2Vec2ForCTC,
    Wav2Vec2Processor,
)

model_name = "Ilyes/wav2vec2-large-xlsr-53-french_punctuation"

model = Wav2Vec2ForCTC.from_pretrained(model_name).to('cuda')
processor = Wav2Vec2Processor.from_pretrained(model_name)

ds = load_dataset("common_voice", "fr", split="test")

chars_to_ignore_regex = '[\;\:\"\“\%\‘\”\�\‘\’\’\’\‘\…\·\ǃ\«\‹\»\›“\”\\ʿ\ʾ\„\∞\\|\;\:\*\—\–\─\―\_\/\:\ː\;\=\«\»\→]'
def normalize_text(text):
    text = text.lower().strip()
    text = re.sub('œ', 'oe', text)
    text = re.sub('æ', 'ae', text)
    text = re.sub("’|´|′|ʼ|‘|ʻ|`", "'", text)
    text = re.sub("'+ ", " ", text)
    text = re.sub(" '+", " ", text)
    text = re.sub("'$", " ", text)
    text = re.sub("' ", " ", text)
    text = re.sub("−|‐", "-", text)
    text = re.sub(" -", "", text)
    text = re.sub("- ", "", text)
    text = re.sub(chars_to_ignore_regex, '', text)
    return text

def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    batch["speech"] = resampler.forward(speech.squeeze(0)).numpy()
    batch["sampling_rate"] = resampler.new_freq
    batch["sentence"] = normalize_text(batch["sentence"])
    return batch

ds = ds.map(map_to_array)

resampler = torchaudio.transforms.Resample(48_000, 16_000)
def map_to_pred(batch):
    features = processor(batch["speech"], sampling_rate=batch["sampling_rate"][0], padding=True, return_tensors="pt")
    input_values = features.input_values.to(device)
    attention_mask = features.attention_mask.to(device)
    with torch.no_grad():
        logits = model(input_values, attention_mask=attention_mask).logits
    pred_ids = torch.argmax(logits, dim=-1)
    batch["predicted"] = processor.batch_decode(pred_ids)
    batch["target"] = batch["sentence"]
    # remove duplicates
    batch["target"] = re.sub('\.+', '.', batch["target"])
    batch["target"] = re.sub('\?+', '?', batch["target"])
    batch["target"] = re.sub('!+', '!', batch["target"])
    batch["target"] = re.sub(',+', ',', batch["target"])
    return batch

result = ds.map(map_to_pred, batched=True, batch_size=16, remove_columns=list(ds.features.keys()))
wer = load_metric("wer")
print(wer.compute(predictions=result["predicted"], references=result["target"]))

📚 Documentation

Some Results

Reference	Prediction
il vécut à new york et y enseigna une grande partie de sa vie.	il a vécu à new york et y enseigna une grande partie de sa vie.
au classement par nations, l'allemagne est la tenante du titre.	au classement der nation l'allemagne est la tenante du titre.
voici un petit calcul pour fixer les idées.	voici un petit calcul pour fixer les idées.
oh! tu dois être beau avec	oh! tu dois être beau avec.
babochet vous le voulez?	baboche, vous le voulez?
la commission est, par conséquent, défavorable à cet amendement.	la commission est, par conséquent, défavorable à cet amendement.

All the references and predictions of the test corpus are already available in this repository.

Overall Results

Text + Punctuation: WER = 21.47%, CER = 7.21%
Text (without punctuation): WER = 19.71%, CER = 6.91%

📄 License

This model is licensed under the Apache - 2.0 license.

📦 Model Information

Property	Details
Model Type	wav2vec2-large-xlsr-53-French_punctuation
Training Data	Common Voice
Tags	audio, automatic-speech-recognition, speech, xlsr-fine-tuning
Model Name	wav2vec2-large-xlsr-53-French_punctuation by Ilyes Rebai
Speech Recognition Metrics on Common Voice (fr)	Test WER and CER on text and punctuation prediction: [19.47%, 6.66%]; Test WER and CER on text without punctuation: [17.88%, 6.37%]

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご