🚀 LeBenchmark: wav2vec2 base model trained on 2.6K hours of French speech
LeBenchmark offers an ensemble of pretrained wav2vec2 models on various French datasets, covering spontaneous, read, and broadcasted speech. It has two versions, where the later version (LeBenchmark 2.0) extends the first one in terms of both the number of pre - trained SSL models and downstream tasks. For more details on evaluating wav2vec2 models, refer to our paper: LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self - supervised Representations of French Speech
✨ Features
Model and data descriptions
We've released four different models available under our HuggingFace organization. Four wav2vec2 architectures (Light, Base, Large, and xLarge) are paired with our small (1K), medium (3K), large (7K), and extra - large (14K) corpora.
Lebenchmark 2.0
- [wav2vec2 - FR - 14K - xlarge](https://huggingface.co/LeBenchmark/wav2vec2 - FR - 14K - xlarge): xLarge wav2vec2 trained on 14K hours of French speech (5.4K Males / 2.4K Females / 6.8K unknown).
- [wav2vec2 - FR - 14K - large](https://huggingface.co/LeBenchmark/wav2vec2 - FR - 14K - large): Large wav2vec2 trained on 14K hours of French speech (5.4K Males / 2.4K Females / 6.8K unknown).
- [wav2vec2 - FR - 14K - light](https://huggingface.co/LeBenchmark/wav2vec2 - FR - 14K - light): Light wav2vec2 trained on 14K hours of French speech (5.4K Males / 2.4K Females / 6.8K unknown).
Lebenchmark
- [wav2vec2 - FR - 7K - large](https://huggingface.co/LeBenchmark/wav2vec2 - FR - 7K - large): Large wav2vec2 trained on 7.6K hours of French speech (1.8K Males / 1.0K Females / 4.8K unknown).
- [wav2vec2 - FR - 7K - base](https://huggingface.co/LeBenchmark/wav2vec2 - FR - 7K - base): Base wav2vec2 trained on 7.6K hours of French speech (1.8K Males / 1.0K Females / 4.8K unknown).
- [wav2vec2 - FR - 3K - large](https://huggingface.co/LeBenchmark/wav2vec2 - FR - 3K - large): Large wav2vec2 trained on 2.9K hours of French speech (1.8K Males / 1.0K Females / 0.1K unknown).
- [wav2vec2 - FR - 3K - base](https://huggingface.co/LeBenchmark/wav2vec2 - FR - 3K - base): Base wav2vec2 trained on 2.9K hours of French speech (1.8K Males / 1.0K Females / 0.1K unknown).
- [wav2vec2 - FR - 2.6K - base](https://huggingface.co/LeBenchmark/wav2vec2 - FR - 2.6K - base): Base wav2vec2 trained on 2.6K hours of French speech (no spontaneous speech).
- [wav2vec2 - FR - 1K - large](https://huggingface.co/LeBenchmark/wav2vec2 - FR - 1K - large): Large wav2vec2 trained on 1K hours of French speech (0.5K Males / 0.5K Females).
- [wav2vec2 - FR - 1K - base](https://huggingface.co/LeBenchmark/wav2vec2 - FR - 1K - base): Base wav2vec2 trained on 1K hours of French speech (0.5K Males / 0.5K Females).
Intended uses & limitations
Pretrained wav2vec2 models are distributed under the Apache - 2.0 license, allowing extensive reuse without strict limitations. However, benchmarks and data may be associated with corpora that are not fully open - sourced.
Fine - tune with Fairseq for ASR with CTC
Since our wav2vec2 models were trained with Fairseq, they can be used in the tools provided by Fairseq to fine - tune the model for ASR with CTC. The full process is well - summarized in [this blogpost](https://huggingface.co/blog/fine - tune - wav2vec2 - english).
Note that due to the nature of CTC, speech - to - text results are not expected to be state - of - the - art. Future features may emerge depending on the involvement of Fairseq and HuggingFace.
Integrate to SpeechBrain for ASR, Speaker, Source Separation ...
Pretrained wav2vec models have become popular recently. Meanwhile, the SpeechBrain toolkit offers a new and simpler approach to state - of - the - art speech and deep - learning technologies.
Although it's currently in beta, SpeechBrain provides two ways to integrate wav2vec2 models trained with Fairseq (our LeBenchmark models):
- Extract wav2vec2 features on - the - fly (with a frozen wav2vec2 encoder) to combine with any speech - related architecture, such as E2E ASR with CTC+Att+Language Models, Speaker Recognition or Verification, Source Separation, etc.
- Experimental: To fully utilize wav2vec2, fine - tuning the model during downstream task training is the best option. This is easily achievable in SpeechBrain by turning on a flag. So, our wav2vec2 models can be fine - tuned while training your preferred ASR pipeline or Speaker Recognizer.
If interested, follow this [tutorial](https://colab.research.google.com/drive/17Hu1pxqhfMisjkSgmM2CnZxfqDyn2hSY?usp = sharing)
📄 License
Pretrained wav2vec2 models are distributed under the Apache - 2.0 license.
📚 Documentation
Referencing LeBenchmark
@misc{parcollet2023lebenchmark,
title={LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self - supervised Representations of French Speech},
author={Titouan Parcollet and Ha Nguyen and Solene Evain and Marcely Zanon Boito and Adrien Pupier and Salima Mdhaffar and Hang Le and Sina Alisamir and Natalia Tomashenko and Marco Dinarelli and Shucong Zhang and Alexandre Allauzen and Maximin Coavoux and Yannick Esteve and Mickael Rouvier and Jerome Goulian and Benjamin Lecouteux and Francois Portet and Solange Rossato and Fabien Ringeval and Didier Schwab and Laurent Besacier},
year={2023},
eprint={2309.05472},
archivePrefix={arXiv},
primaryClass={cs.CL}
}