The open-source wav2vec2-french-phonemizer model - Helping with accurate conversion of French speech into phonemes

Wav2vec2 French Phonemizer

Developed by Cnam-LMSSC

This is a model fine-tuned for the task of French speech to phoneme conversion, based on the facebook/wav2vec2-base-fr-voxpopuli-v2 model and trained using the Common Voice v13 dataset.

Speech Recognition

Transformers

FrenchOpen Source License:MIT #French speech to phoneme conversion #High-precision phonetic transcription #16kHz audio processing

Downloads 9,832

Release Time : 11/8/2023

Model Overview

This model can convert French speech into a phoneme sequence encoded in the International Phonetic Alphabet (IPA), providing support for speech processing-related tasks.

Model Features

Task-specific fine-tuning

Optimized specifically for the task of French speech to phoneme conversion, improving performance on this task

Multi-dataset validation

Performs well on multiple datasets such as Common Voice v13 and Multilingual Librispeech

High-quality phonetic output

The output is encoded in the International Phonetic Alphabet (IPA) and can be directly used for downstream tasks such as speech synthesis

Model Capabilities

French speech recognition

Phoneme conversion

Speech processing

Use Cases

Speech processing

Speech to phoneme conversion

Convert French speech into a phoneme sequence

The phoneme error rate (PER) is 5.52% on Common Voice v13 and 4.36% on Multilingual Librispeech

🚀 Fine-tuned French Voxpopuli v2 wav2vec2-base model for speech-to-phoneme task in French

This model is a fine-tuned version of facebook/wav2vec2-base-fr-voxpopuli-v2 for the French speech-to-phoneme task (without a language model). It is trained using the train and validation splits of Common Voice v13.

🚀 Quick Start

Audio Samplerate for Usage

When using this model, ensure that your speech input is sampled at 16kHz.

Output

As this model is specifically trained for a speech-to-phoneme task, the output is a sequence of IPA-encoded words, without punctuation. If you're not fluent in reading the phonetic alphabet, you can use this excellent IPA reader website to convert the transcript back to audio synthetic speech to check the quality of the phonetic transcription.

✨ Features

Datasets: Utilizes the mozilla-foundation/common_voice_13_0 dataset.
Metrics: Evaluated using the Phoneme Error Rate (PER).
Model Index:
- Name: Wav2Vec2-base French finetuned for phonemes by LMSSC
- Results:
  - Task: Speech Recognition
  - Dataset: Common Voice v13 (French)
  - Metrics:
    - Test PER on Common Voice FR 13.0 | Trained: 5.52
    - Test PER on Multilingual Librispeech FR | Trained: 4.36
    - Val PER on Common Voice FR 13.0 | Trained: 4.31

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

The model can be used directly using the HuggingSound library:

import pandas as pd
from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("Cnam-LMSSC/wav2vec2-french-phonemizer")
audio_paths = ["./test_relecture_texte.wav", "./10179_11051_000021.flac"]

# No need for the Audio files to be sampled at 16 kHz here,
# they are automatically resampled by Huggingsound

transcriptions = model.transcribe(audio_paths)

# (Optionnal) Display results in a table :
## transcriptions is list of dicts also containing timestamps and probabilities !

df = pd.DataFrame(transcriptions)
df['Audio file'] = pd.DataFrame(audio_paths)
df.set_index('Audio file', inplace=True)
df[['transcription']]

Output:

Audio file	Phonetic transcription (IPA)
./test_relecture_texte.wav	ʃapitʁ di də abɛse pəti kɔ̃t də ʒyl ləmɛtʁ ɑ̃ʁʒistʁe puʁ libʁivɔksɔʁɡ ibis dɑ̃ la bas kuʁ dœ̃ ʃato sə tʁuva paʁmi tut sɔʁt də volaj œ̃n ibis ʁɔz
./10179_11051_000021.flac	kɛl dɔmaʒ kə sə nə swa pa dy sykʁ supiʁa se foʁaz ɑ̃ pasɑ̃ sa lɑ̃ɡ syʁ la vitʁ fɛ̃ dy ʃapitʁ kɛ̃z ɑ̃ʁʒistʁe paʁ sonjɛ̃ sɛt ɑ̃ʁʒistʁəmɑ̃ fɛ paʁti dy domɛn pyblik

Advanced Usage

If you do not want to use the huggingsound library, you can use the following inference script:

import torch
from transformers import AutoModelForCTC, Wav2Vec2Processor
from datasets import load_dataset
import soundfile as sf # Or Librosa if you prefer to ... 

MODEL_ID = "Cnam-LMSSC/wav2vec2-french-phonemizer"

model = AutoModelForCTC.from_pretrained(MODEL_ID)
processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)

audio = sf.read('example.wav')
# Make sure you have a 16 kHz sampled audio file, or resample it !

inputs = processor(np.array(audio[0]),sampling_rate=16_000., return_tensors="pt")

with torch.no_grad():
  logits = model(**inputs).logits

predicted_ids = torch.argmax(logits,dim = -1)
transcription = processor.batch_decode(predicted_ids)

print("Phonetic transcription : ", transcription)

Output: 'ʒə syi tʁɛ kɔ̃tɑ̃ də vu pʁezɑ̃te notʁ solysjɔ̃ puʁ fonomize dez odjo fasilmɑ̃ sa fɔ̃ksjɔn kɑ̃ mɛm tʁɛ bjɛ̃'

📚 Documentation

Training Procedure

The model has been finetuned on Commonvoice-v13 (FR) for 14 epochs on a 4x2080 Ti GPUs at Cnam/LMMSC using a ddp strategy and gradient-accumulation procedure (256 audios per update, corresponding roughly to 25 minutes of speech per update -> 2k updates per epoch).

Learning rate schedule: Double Tri-state schedule
- Warmup from 1e-5 for 7% of total updates
- Constant at 1e-4 for 28% of total updates
- Linear decrease to 1e-6 for 36% of total updates
- Second warmup boost to 3e-5 for 3% of total updates
- Constant at 3e-5 for 12% of total updates
- Linear decrease to 1e-7 for remaining 14% of updates
The set of hyperparameters used for training are the same as those detailed in Annex B and Table 6 of wav2vec2 paper.

🔧 Technical Details

The model is a fine-tuned version of facebook/wav2vec2-base-fr-voxpopuli-v2 for the French speech-to-phoneme task. It is trained on the Common Voice v13 dataset.

📄 License

This model is released under the MIT license.

📈 Test Results

In the table below, we report the Phoneme Error Rate (PER) of the model on both Common Voice and Multilingual Librispeech (using the French configs for both datasets of course), when finetuned on Common Voice train set only:

Model	Test Set	PER
Cnam-LMSSC/wav2vec2-french-phonemizer	Common Voice v13 (French)	5.52%
Cnam-LMSSC/wav2vec2-french-phonemizer	Multilingual Librispeech (French)	4.36%

📝 Citation

If you use this finetuned model for any publication, please use the following to cite our work:

@misc {lmssc-wav2vec2-base-phonemizer-french_2023,
	author       = { Olivier, Malo AND Hauret, Julien AND Bavu, {É}ric },
	title        = { wav2vec2-french-phonemizer (Revision e715906) },
	year         = 2023,
	url          = { https://huggingface.co/Cnam-LMSSC/wav2vec2-french-phonemizer },
	doi          = { 10.57967/hf/1339 },
	publisher    = { Hugging Face }
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご