Open-source Speech Recognition Model wav2vec2-large-xlsr-53-breton

Wav2vec2 Large Xlsr 53 Breton

Developed by mrm8488

A Breton fine-tuned speech recognition model based on facebook/wav2vec2-large-xlsr-53

Speech Recognition OtherOpen Source License:Apache-2.0 #Breton speech recognition #Low-resource language processing #XLSR fine-tuning

Downloads 26

Release Time : 3/2/2022

Model Overview

This is an optimized automatic speech recognition (ASR) model for Breton, fine-tuned from Facebook's wav2vec2-large-xlsr-53 architecture using the Common Voice dataset.

Model Features

Multilingual pre-training foundation

Fine-tuned from Facebook's multilingual XLSR-53 model with strong cross-lingual learning capabilities

Breton optimization

Specifically optimized for Breton language characteristics

16kHz sampling rate support

Supports 16kHz sampled audio input, suitable for common speech applications

Model Capabilities

Speech recognition

Breton speech-to-text

Automatic speech transcription

Use Cases

Speech transcription

Breton speech transcription

Convert Breton speech content to text

Achieved 46.49% WER on Common Voice test set

Voice assistants

Breton voice command recognition

For Breton voice assistants or device control

🚀 Wav2Vec2-Large-XLSR-53-breton

This project fine-tunes the facebook/wav2vec2-large-xlsr-53 model in Breton using the Common Voice dataset, aiming to provide an effective solution for Breton automatic speech recognition.

🚀 Quick Start

When using this model, ensure that your speech input is sampled at 16kHz.

✨ Features

Language: Breton
Datasets: Utilizes the Common Voice dataset for fine - tuning.
Task: Suitable for automatic speech recognition tasks.

Property	Details
Model Type	XLSR Wav2Vec2 Breton Manuel Romero
Training Data	Common Voice `train`, `validation` datasets

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "br", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("mrm8488/wav2vec2-large-xlsr-53-breton")
model = Wav2Vec2ForCTC.from_pretrained("mrm8488/wav2vec2-large-xlsr-53-breton")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
  logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📚 Documentation

Evaluation

The model can be evaluated on the Breton test data of Common Voice as follows:

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "br", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("mrm8488/wav2vec2-large-xlsr-53-breton")
model = Wav2Vec2ForCTC.from_pretrained("mrm8488/wav2vec2-large-xlsr-53-breton")
model.to("cuda")

chars_to_ignore_regex = '[\\,\\?\\.\\!\\-\\;\\:\\\"\\“\\%\\‘\\”\\�]'
resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
  batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
  speech_array, sampling_rate = torchaudio.load(batch["path"])
  batch["speech"] = resampler(speech_array).squeeze().numpy()
  return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
  inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

  with torch.no_grad():
    logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

  pred_ids = torch.argmax(logits, dim=-1)
  batch["pred_strings"] = processor.batch_decode(pred_ids)
  return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

Test Result: 46.49 %

Training

The Common Voice train, validation datasets were used for training. The script used for training can be found ???

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご