wav2vec2-large-mms-1b-wolof開源模型 - 免費實現沃洛夫語自動語音識別

首頁

Wav2vec2 Large Mms 1b Wolof

由bilalfaye開發

該模型是基於facebook/mms-1b-all在Isma/alffa_wolof數據集上微調的版本，專為沃洛夫語自動語音識別（ASR）任務設計。

語音識別

Safetensors

其他開源協議:MIT #沃洛夫語ASR #低資源語音識別 #MMS微調

下載量 50

發布時間 : 1/8/2025

模型概述

該模型基於Wav2Vec 2.0架構，針對語音識別任務進行了微調。基礎模型facebook/mms-1b-all是在多語言語料庫上訓練的通用途ASR模型。此微調版本專門在Waxal Wolof數據集上訓練，該數據集包含沃洛夫語的音頻錄音。

模型特點

多語言支持

基於facebook/mms-1b-all模型，支持多語言語音識別。

沃洛夫語優化

專門在沃洛夫語數據集上微調，提升了對沃洛夫語語音特徵的識別準確率。

高效訓練

使用混合精度訓練和Adam優化器，訓練效率高。

模型能力

沃洛夫語語音識別

多語言語音識別

使用案例

語音轉文本

沃洛夫語錄音轉錄

將沃洛夫語的音頻錄音轉錄為文字。

詞錯誤率（WER）為0.1842

🚀 wav2vec2-large-mms-1b-wolof

本模型是基於 facebook/mms-1b-all 在 Isma/alffa_wolof 數據集上進行微調的版本，旨在實現沃洛夫語（Wolof）的自動語音識別（ASR）。

🚀 快速開始

本模型基於Wav2Vec 2.0架構，針對語音識別任務進行了微調。基礎模型 facebook/mms-1b-all 在多語言語料庫上進行訓練，用於通用的自動語音識別。此微調版本專門在包含沃洛夫語音頻記錄的 Waxal Wolof 數據集上進行訓練。

✨ 主要特性

基於Wav2Vec 2.0架構，針對沃洛夫語自動語音識別進行微調。
可將沃洛夫語音頻轉錄為文本。

📦 安裝指南

! pip install datasets

💻 使用示例

基礎用法

手動推理代碼示例：

# Load test dataset
from datasets import load_dataset, Audio

dataset = load_dataset("perrynelson/waxal-wolof", trust_remote_code=True)
dataset

# Display the first audio using Ipython
from IPython.display import Audio, display

Audio(dataset['train'][322]['audio']['array'], rate=16000)

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch

model_id = "bilalfaye/wav2vec2-large-mms-1b-wolof"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model on CPU first
model = Wav2Vec2ForCTC.from_pretrained(model_id, 
                                       target_lang="wol", 
                                       torch_dtype=torch.float16  # Use half-precision
                                       ).to(device)


processor = Wav2Vec2Processor.from_pretrained(model_id)
processor.tokenizer.set_target_lang("wol")


# Process the audio
input_dict = processor(
    dataset['train'][322]["audio"]["array"],
    sampling_rate=16_000,
    return_tensors="pt",
    padding=True
)

# Move inputs to the appropriate device for the first processing layer
input_values = input_dict.input_values.to(device, dtype=torch.float16)

# Perform inference
logits = model(input_values).logits

# Decode predictions
pred_ids = torch.argmax(logits, dim=-1)[0]

print("Prediction:")
print(processor.decode(pred_ids))

print("\nReference:")
print(dataset['train'][322]['transcription'].lower())

高級用法

使用pipeline進行推理的代碼示例：

from transformers import pipeline
import torch

# Model ID
model_id = "bilalfaye/wav2vec2-large-mms-1b-wolof"

# Determine device (use GPU if available, otherwise fallback to CPU)
device = 0 if torch.cuda.is_available() else -1

# Use half precision (float16) for inference if GPU is available
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Set up the pipeline for automatic speech recognition
pipe = pipeline(
    task="automatic-speech-recognition", 
    model=model_id, 
    processor=model_id, 
    device=device,  # Specify the device (GPU if available, otherwise CPU)
    torch_dtype=torch_dtype,  # Set the precision (float16 for half precision, float32 otherwise)
    framework="pt"  # Use PyTorch as the framework
)

# Input audio processing
audio_array = dataset['train'][322]["audio"]["array"]  # Fetching an audio sample

# Run inference
result = pipe(audio_array)

# Prediction
print("Prediction:")
print(result['text'])

# Reference (for comparison)
print("\nReference:")
print(dataset['train'][322]['transcription'].lower())

釋放內存代碼示例

import gc
import torch
import psutil

# Free up unused memory in CUDA (GPU) - only needed if you use a GPU
if torch.cuda.is_available():
    torch.cuda.empty_cache()  # Clears GPU memory cache
    torch.cuda.reset_peak_memory_stats()  # Resets memory stats

# Collect any unused memory in Python (CPU)
gc.collect()  # Collect unused memory in Python's garbage collector

# Optionally, check memory status after clearing
if torch.cuda.is_available():
    print(f"GPU Memory Allocated: {torch.cuda.memory_allocated()} bytes")
    print(f"GPU Memory Cached: {torch.cuda.memory_reserved()} bytes")
else:
    print(f"CPU Memory Usage: {psutil.virtual_memory().percent}%")

📚 詳細文檔

訓練超參數

訓練過程中使用了以下超參數：

learning_rate: 0.0001
train_batch_size: 16
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 20
mixed_precision_training: Native AMP

訓練結果

訓練損失	輪數	步數	驗證損失	字錯誤率
0.3793	14.0	12250	0.1517	0.1888
0.3709	15.0	13125	0.1512	0.1882
0.3702	16.0	14000	0.1499	0.1858
0.367	17.0	14875	0.1492	0.1848
0.3656	18.0	15750	0.1493	0.1842

框架版本

Transformers 4.41.2
Pytorch 2.4.0+cu121
Datasets 3.2.0
Tokenizers 0.19.1

🔧 技術細節

本模型基於Wav2Vec 2.0架構，該架構在語音識別任務中表現出色。基礎模型 facebook/mms-1b-all 在多語言語料庫上進行預訓練，為通用的自動語音識別提供了強大的基礎。微調過程中，使用了沃洛夫語的 Isma/alffa_wolof 數據集，通過調整模型參數，使其更適應沃洛夫語的語音特徵。