wav2vec2-large-mms-1b-wolofオープンソースモデル - フリーでウォロフ語の自動音声認識を実現

ホーム

Wav2vec2 Large Mms 1b Wolof

bilalfayeによって開発

このモデルはfacebook/mms-1b-allをIsma/alffa_wolofデータセットでファインチューニングしたバージョンで、ウォロフ語の自動音声認識（ASR）タスク専用に設計されています。

音声認識

Safetensors

その他オープンソースライセンス:MIT #ウォロフ語ASR #低リソース音声認識 #MMSファインチューニング

ダウンロード数 50

リリース時間 : 1/8/2025

モデル概要

このモデルはWav2Vec 2.0アーキテクチャに基づいており、音声認識タスク向けにファインチューニングされています。ベースモデルのfacebook/mms-1b-allは多言語コーパスで訓練された汎用ASRモデルです。このファインチューニング版は特にWaxal Wolofデータセットで訓練されており、ウォロフ語の音声録音が含まれています。

モデル特徴

多言語サポート

facebook/mms-1b-allモデルに基づいており、多言語音声認識をサポートしています。

ウォロフ語最適化

ウォロフ語データセットで特別にファインチューニングされており、ウォロフ語の音声特徴の認識精度が向上しています。

効率的な訓練

混合精度訓練とAdamオプティマイザを使用しており、訓練効率が高いです。

モデル能力

ウォロフ語音声認識

多言語音声認識

使用事例

音声からテキストへ

ウォロフ語録音文字起こし

ウォロフ語の音声録音をテキストに変換します。

単語誤り率（WER）は0.1842

🚀 wav2vec2-large-mms-1b-wolof

このモデルは、facebook/mms-1b-all を Isma/alffa_wolof データセットでファインチューニングしたバージョンです。ウォロフ語の自動音声認識 (ASR) を行うように設計されています。

📚 ドキュメント

モデルの説明

このモデルは、音声認識タスク用にファインチューニングされた Wav2Vec 2.0 アーキテクチャに基づいています。ベースモデルである facebook/mms-1b-all は、汎用的な ASR のために多言語コーパスで学習されました。このファインチューニングされたバージョンは、ウォロフ語の音声録音を含む Waxal Wolof データセットで特別に学習されています。

学習と評価データ

このモデルは、ウォロフ語の音声サンプルを含む Isma/alffa_wolof データセットで学習されました。このデータセットは、ウォロフ語の音声の特定の音韻特性に対する精度を向上させるためにモデルをファインチューニングするために使用されます。

手動での推論

! pip install datasets

# Load test dataset
from datasets import load_dataset, Audio

dataset = load_dataset("perrynelson/waxal-wolof", trust_remote_code=True)
dataset

# Display the first audio using Ipython
from IPython.display import Audio, display

Audio(dataset['train'][322]['audio']['array'], rate=16000)

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch

model_id = "bilalfaye/wav2vec2-large-mms-1b-wolof"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load the model on CPU first
model = Wav2Vec2ForCTC.from_pretrained(model_id, 
                                       target_lang="wol", 
                                       torch_dtype=torch.float16  # Use half-precision
                                       ).to(device)


processor = Wav2Vec2Processor.from_pretrained(model_id)
processor.tokenizer.set_target_lang("wol")


# Process the audio
input_dict = processor(
    dataset['train'][322]["audio"]["array"],
    sampling_rate=16_000,
    return_tensors="pt",
    padding=True
)

# Move inputs to the appropriate device for the first processing layer
input_values = input_dict.input_values.to(device, dtype=torch.float16)

# Perform inference
logits = model(input_values).logits

# Decode predictions
pred_ids = torch.argmax(logits, dim=-1)[0]

print("Prediction:")
print(processor.decode(pred_ids))

print("\nReference:")
print(dataset['train'][322]['transcription'].lower())

パイプラインを使用した推論

from transformers import pipeline
import torch

# Model ID
model_id = "bilalfaye/wav2vec2-large-mms-1b-wolof"

# Determine device (use GPU if available, otherwise fallback to CPU)
device = 0 if torch.cuda.is_available() else -1

# Use half precision (float16) for inference if GPU is available
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Set up the pipeline for automatic speech recognition
pipe = pipeline(
    task="automatic-speech-recognition", 
    model=model_id, 
    processor=model_id, 
    device=device,  # Specify the device (GPU if available, otherwise CPU)
    torch_dtype=torch_dtype,  # Set the precision (float16 for half precision, float32 otherwise)
    framework="pt"  # Use PyTorch as the framework
)

# Input audio processing
audio_array = dataset['train'][322]["audio"]["array"]  # Fetching an audio sample

# Run inference
result = pipe(audio_array)

# Prediction
print("Prediction:")
print(result['text'])

# Reference (for comparison)
print("\nReference:")
print(dataset['train'][322]['transcription'].lower())

メモリの解放

import gc
import torch
import psutil

# Free up unused memory in CUDA (GPU) - only needed if you use a GPU
if torch.cuda.is_available():
    torch.cuda.empty_cache()  # Clears GPU memory cache
    torch.cuda.reset_peak_memory_stats()  # Resets memory stats

# Collect any unused memory in Python (CPU)
gc.collect()  # Collect unused memory in Python's garbage collector

# Optionally, check memory status after clearing
if torch.cuda.is_available():
    print(f"GPU Memory Allocated: {torch.cuda.memory_allocated()} bytes")
    print(f"GPU Memory Cached: {torch.cuda.memory_reserved()} bytes")
else:
    print(f"CPU Memory Usage: {psutil.virtual_memory().percent}%")

学習のハイパーパラメータ

学習中に以下のハイパーパラメータが使用されました。

learning_rate: 0.0001
train_batch_size: 16
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
num_epochs: 20
mixed_precision_training: Native AMP

学習結果

学習損失	エポック	ステップ	検証損失	Wer
0.3793	14.0	12250	0.1517	0.1888
0.3709	15.0	13125	0.1512	0.1882
0.3702	16.0	14000	0.1499	0.1858
0.367	17.0	14875	0.1492	0.1848
0.3656	18.0	15750	0.1493	0.1842

フレームワークのバージョン

Transformers 4.41.2
Pytorch 2.4.0+cu121
Datasets 3.2.0
Tokenizers 0.19.1

想定される用途と制限

想定される用途: このモデルは、ウォロフ語の音声テキスト変換タスクに使用することを想定しています。ウォロフ語の音声録音を書面テキストに変換するために使用できます。
制限: このモデルは、クリーンな音声で最適な性能を発揮し、ノイズの多いまたは低品質の録音では苦労する可能性があります。このモデルはウォロフ語専用に設計されており、他の言語ではうまく機能しない可能性があります。