SSA-HuBERT-base-60k Open-Source Speech Model - A Free Tool for Precise Adaptation to 21 African Languages

SSA HuBERT Base 60k

Developed by Orange

A self-supervised speech model based on the HuBERT architecture, specifically optimized for 21 languages in Sub-Saharan Africa with 60,000 hours of training data

Speech Recognition

Transformers

#African multilingual speech recognition #Self-supervised speech pre-training #Low-resource language optimization

Downloads 995

Release Time : 6/20/2024

Model Overview

This model employs self-supervised learning for pre-training and is suitable for multilingual speech recognition tasks in Africa, with special optimizations for performance in noisy environments

Model Features

African Language Optimization

Specifically optimized for 21 Sub-Saharan African languages and their variants

Diverse Training Data

Includes studio recordings and street interview data, covering both controlled and noisy environments

Self-supervised Learning

Utilizes the HuBERT self-supervised learning framework, requiring minimal labeled data

Multilingual Support

A single model supports speech recognition for multiple African languages

Model Capabilities

Speech recognition

Multilingual processing

Noisy environment speech processing

Use Cases

Speech Transcription

African Language Speech Transcription

Converts speech in multiple African languages to text

Average CER of 15.8 and WER of 52.3 on the FLEURS dataset

Speech Assistive Technology

African Language Voice Assistant

Develops voice-controlled applications for African regions

🚀 SSA-HuBERT-base-60k: Self-Supervised Speech Model

This self-supervised speech model (SSA-HuBERT-base-60k) is based on the HuBERT Base architecture (~95M params). It addresses the challenge of multilingual speech processing in Sub - Saharan Africa by leveraging nearly 60 000 hours of speech segments, covering 21 languages and variants.

🚀 Quick Start

The model is ready for use after fine - tuning. For ASR fine - tuning, the SpeechBrain toolkit is used, and the FLEURS dataset is applied for each language.

✨ Features

Multilingual Coverage: Covers 21 languages and variants spoken in Sub - Saharan Africa.
Large - scale Training: Trained on nearly 60 000 hours of speech segments.
Self - supervised Learning: Based on the HuBERT Base architecture for effective speech representation learning.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

No code examples are provided in the original document.

📚 Documentation

Model description

This self - supervised speech model (a.k.a. SSA - HuBERT - base - 60k) is based on a HuBERT Base architecture (~95M params) [1]. It was trained on nearly 60 000 hours of speech segments and covers 21 languages and variants spoken in Sub - Saharan Africa.

Pretraining data

Dataset: The training dataset was composed of both studio recordings (controlled environment, prepared talks) and street interviews (noisy environment, spontaneous speech).
Languages: Bambara (bam), Dyula (dyu), French (fra), Fula (ful), Fulfulde (ffm), Fulfulde (fuh), Gulmancema (gux), Hausa (hau), Kinyarwanda (kin), Kituba (ktu), Lingala (lin), Luba - Lulua (lua), Mossi (mos), Maninkakan (mwk), Sango (sag), Songhai (son), Swahili (swc), Swahili (swh), Tamasheq (taq), Wolof (wol), Zarma (dje).

ASR fine - tuning

The SpeechBrain toolkit (Ravanelli et al., 2021) is used to fine - tune the model. Fine - tuning is done for each language using the FLEURS dataset [2]. The pretrained model (SSA - HuBERT - base - 60k) is considered as a speech encoder and is fully fine - tuned with two 1024 linear layers and a softmax output at the top.

Results

The following results are obtained in a greedy mode (no language model rescoring). Character error rates (CERs) and Word error rates (WERs) are given in the table below, on the 20 languages of the SSA subpart of the FLEURS dataset.

Language	CER	CER (joint finetuning)	WER	WER (joint finetuning)
Afrikaans	23.3	20.3	68.4	62.6
Amharic	15.9	14.9	52.7	49.0
Fula	21.2	17.8	61.9	56.4
Ganda	11.5	10.7	52.8	50.3
Hausa	10.5	9.0	32.5	29.4
Igbo	19.7	17.2	57.5	52.9
Kamba	16.1	15.6	53.9	53.7
Lingala	8.7	6.9	24.7	20.9
Luo	9.9	8.2	38.9	34.9
Northen - Sotho	13.5	11.7	43.2	38.9
Nyanja	13.3	10.9	54.2	48.3
Oromo	22.8	20.1	78.1	74.8
Shona	11.6	8.3	50.2	39.3
Somali	21.6	19.7	64.9	60.3
Swahili	7.1	5.5	23.8	20.3
Umbundu	21.7	18.8	61.7	54.2
Wolof	19.4	17.0	55.0	50.7
Xhosa	11.9	9.9	51.6	45.9
Yoruba	24.3	23.5	67.5	65.7
Zulu	12.2	9.6	53.4	44.9
Overall average	15.8	13.8	52.3	47.7

Reproductibilty

We propose a notebook to reproduce the ASR experiments mentioned in our paper. See SB_ASR_FLEURS_finetuning.ipynb. By using the ASR_FLEURS - swahili_hf.yaml config file, you will be able to run the recipe on Swahili.

🔧 Technical Details

The model is based on the HuBERT Base architecture. The pretraining data consists of studio recordings and street interviews. For fine - tuning, the SpeechBrain toolkit is used with the FLEURS dataset, and the model is fully fine - tuned with two 1024 linear layers and a softmax output at the top.

📄 License

This model is released under the CC - by - NC 4.0 conditions.

Publication

This model were presented at AfricaNLP 2024. The associated paper is available here: Africa - Centric Self - Supervised Pre - Training for Multilingual Speech Representation in a Sub - Saharan Context

Citation

Please cite our paper when using SSA - HuBERT - base - 60k model:

Caubrière, A., & Gauthier, E. (2024). Africa - Centric Self - Supervised Pre - Training for Multilingual Speech Representation in a Sub - Saharan Context. In 5th Workshop on African Natural Language Processing (AfricaNLP 2024).

Bibtex citation:

@inproceedings{caubri{\`e}re2024ssaspeechssl,    
    title={Africa - Centric Self - Supervised Pretraining for Multilingual Speech Representation in a Sub - Saharan Context},    
    author={Antoine Caubri{\`e}re and Elodie Gauthier},    
    booktitle={5th Workshop on African Natural Language Processing},    
    year={2024},    
    url={https://openreview.net/forum?id=zLOhcft2E7}}

References

[1] Wei - Ning Hsu, Benjamin Bolte, Yao - Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. HuBERT: Self - Supervised Speech Representation Learning by Masked Prediction of Hidden Units. In 2021 IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp.3451–3460, 2021. doi: 10.1109/TASLP.2021.3122291. [2] Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few - shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pp. 798–805, 2022. doi: 10.1109/SLT54892.2023.10023141.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご