wav2vec2-large-mms-1b-wolof开源模型 - 免费实现沃洛夫语自动语音识别

首页

Wav2vec2 Large Mms 1b Wolof

由 bilalfaye 开发

该模型是基于facebook/mms-1b-all在Isma/alffa_wolof数据集上微调的版本，专为沃洛夫语自动语音识别（ASR）任务设计。

语音识别

Safetensors

其他开源协议:MIT #沃洛夫语ASR #低资源语音识别 #MMS微调

下载量 50

发布时间 : 1/8/2025

模型简介

该模型基于Wav2Vec 2.0架构，针对语音识别任务进行了微调。基础模型facebook/mms-1b-all是在多语言语料库上训练的通用途ASR模型。此微调版本专门在Waxal Wolof数据集上训练，该数据集包含沃洛夫语的音频录音。

模型特点

多语言支持

基于facebook/mms-1b-all模型，支持多语言语音识别。

沃洛夫语优化

专门在沃洛夫语数据集上微调，提升了对沃洛夫语语音特征的识别准确率。

高效训练

使用混合精度训练和Adam优化器，训练效率高。

模型能力

沃洛夫语语音识别

多语言语音识别

使用案例

语音转文本

沃洛夫语录音转录

将沃洛夫语的音频录音转录为文字。

词错误率（WER）为0.1842

🚀 wav2vec2-large-mms-1b-wolof

本模型是基于 facebook/mms-1b-all 在 Isma/alffa_wolof 数据集上进行微调的版本，旨在实现沃洛夫语（Wolof）的自动语音识别（ASR）。

🚀 快速开始

本模型基于Wav2Vec 2.0架构，针对语音识别任务进行了微调。基础模型 facebook/mms-1b-all 在多语言语料库上进行训练，用于通用的自动语音识别。此微调版本专门在包含沃洛夫语音频记录的 Waxal Wolof 数据集上进行训练。

✨ 主要特性

基于Wav2Vec 2.0架构，针对沃洛夫语自动语音识别进行微调。
可将沃洛夫语音频转录为文本。

📦 安装指南

! pip install datasets

💻 使用示例

基础用法

手动推理代码示例：

# Load test dataset
from datasets import load_dataset, Audio

dataset = load_dataset("perrynelson/waxal-wolof", trust_remote_code=True)
dataset

# Display the first audio using Ipython
from IPython.display import Audio, display

Audio(dataset['train'][322]['audio']['array'], rate=16000)

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch

model_id = "bilalfaye/wav2vec2-large-mms-1b-wolof"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model on CPU first
model = Wav2Vec2ForCTC.from_pretrained(model_id, 
                                       target_lang="wol", 
                                       torch_dtype=torch.float16  # Use half-precision
                                       ).to(device)


processor = Wav2Vec2Processor.from_pretrained(model_id)
processor.tokenizer.set_target_lang("wol")


# Process the audio
input_dict = processor(
    dataset['train'][322]["audio"]["array"],
    sampling_rate=16_000,
    return_tensors="pt",
    padding=True
)

# Move inputs to the appropriate device for the first processing layer
input_values = input_dict.input_values.to(device, dtype=torch.float16)

# Perform inference
logits = model(input_values).logits

# Decode predictions
pred_ids = torch.argmax(logits, dim=-1)[0]

print("Prediction:")
print(processor.decode(pred_ids))

print("\nReference:")
print(dataset['train'][322]['transcription'].lower())

高级用法

使用pipeline进行推理的代码示例：

from transformers import pipeline
import torch

# Model ID
model_id = "bilalfaye/wav2vec2-large-mms-1b-wolof"

# Determine device (use GPU if available, otherwise fallback to CPU)
device = 0 if torch.cuda.is_available() else -1

# Use half precision (float16) for inference if GPU is available
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Set up the pipeline for automatic speech recognition
pipe = pipeline(
    task="automatic-speech-recognition", 
    model=model_id, 
    processor=model_id, 
    device=device,  # Specify the device (GPU if available, otherwise CPU)
    torch_dtype=torch_dtype,  # Set the precision (float16 for half precision, float32 otherwise)
    framework="pt"  # Use PyTorch as the framework
)

# Input audio processing
audio_array = dataset['train'][322]["audio"]["array"]  # Fetching an audio sample

# Run inference
result = pipe(audio_array)

# Prediction
print("Prediction:")
print(result['text'])

# Reference (for comparison)
print("\nReference:")
print(dataset['train'][322]['transcription'].lower())

释放内存代码示例

import gc
import torch
import psutil

# Free up unused memory in CUDA (GPU) - only needed if you use a GPU
if torch.cuda.is_available():
    torch.cuda.empty_cache()  # Clears GPU memory cache
    torch.cuda.reset_peak_memory_stats()  # Resets memory stats

# Collect any unused memory in Python (CPU)
gc.collect()  # Collect unused memory in Python's garbage collector

# Optionally, check memory status after clearing
if torch.cuda.is_available():
    print(f"GPU Memory Allocated: {torch.cuda.memory_allocated()} bytes")
    print(f"GPU Memory Cached: {torch.cuda.memory_reserved()} bytes")
else:
    print(f"CPU Memory Usage: {psutil.virtual_memory().percent}%")

📚 详细文档

训练超参数

训练过程中使用了以下超参数：

learning_rate: 0.0001
train_batch_size: 16
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 20
mixed_precision_training: Native AMP

训练结果

训练损失	轮数	步数	验证损失	字错误率
0.3793	14.0	12250	0.1517	0.1888
0.3709	15.0	13125	0.1512	0.1882
0.3702	16.0	14000	0.1499	0.1858
0.367	17.0	14875	0.1492	0.1848
0.3656	18.0	15750	0.1493	0.1842

框架版本

Transformers 4.41.2
Pytorch 2.4.0+cu121
Datasets 3.2.0
Tokenizers 0.19.1

🔧 技术细节

本模型基于Wav2Vec 2.0架构，该架构在语音识别任务中表现出色。基础模型 facebook/mms-1b-all 在多语言语料库上进行预训练，为通用的自动语音识别提供了强大的基础。微调过程中，使用了沃洛夫语的 Isma/alffa_wolof 数据集，通过调整模型参数，使其更适应沃洛夫语的语音特征。