wav2vec2-large-xlsr-53-arabic开源语音识别模型 - 免费精准识别阿拉伯语语音

首页

Wav2vec2 Large Xlsr 53 Arabic

由 jonatasgrosman 开发

基于facebook/wav2vec2-large-xlsr-53微调的阿拉伯语语音识别模型，在Common Voice和阿拉伯语语音语料库上训练

语音识别阿拉伯语开源协议:Apache-2.0 #阿拉伯语语音识别 #XLSR-53微调 #低词错误率

下载量 2.3M

发布时间 : 3/2/2022

模型简介

针对阿拉伯语优化的自动语音识别(ASR)模型，支持16kHz采样率的语音输入转换为文本

模型特点

高性能阿拉伯语识别

在Common Voice阿拉伯语测试集上达到39.59% WER和18.18% CER，优于同类阿拉伯语ASR模型

多数据集训练

结合Common Voice 6.1和阿拉伯语语音语料库进行训练，提高模型泛化能力

即用型模型

无需额外语言模型即可直接使用，简化部署流程

模型能力

阿拉伯语语音识别

16kHz音频处理

长语音转录

使用案例

语音转文字

语音备忘录转录

将阿拉伯语语音备忘录转换为可搜索的文本

准确率约80%（基于CER推断）

客服对话记录

自动记录阿拉伯语客服通话内容

辅助技术

听力障碍辅助

为听力障碍者提供实时字幕

🚀 针对阿拉伯语语音识别微调的XLSR - 53大模型

本项目微调了 facebook/wav2vec2-large-xlsr-53 模型，用于阿拉伯语语音识别。使用了 Common Voice 6.1 和 Arabic Speech Corpus 的训练集和验证集进行微调。使用该模型时，请确保语音输入的采样率为16kHz。

此模型的微调得益于 OVHcloud 慷慨提供的GPU算力支持😊。训练脚本可在以下链接找到：https://github.com/jonatasgrosman/wav2vec2-sprint

🚀 快速开始

本模型可直接使用（无需语言模型），以下是使用示例。

✨ 主要特性

数据集：使用了Common Voice和Arabic Speech Corpus数据集进行训练。
评估指标：使用了字错误率（WER）和字符错误率（CER）进行评估。
许可证：采用Apache - 2.0许可证。

📦 安装指南

文档未提及安装步骤，暂不展示。

💻 使用示例

基础用法

使用 HuggingSound 库：

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-arabic")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

高级用法

编写自己的推理脚本：

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "ar"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)

以下是参考与预测结果对比：

参考	预测
ألديك قلم ؟	ألديك قلم
ليست هناك مسافة على هذه الأرض أبعد من يوم أمس.	ليست نالك مسافة على هذه الأرض أبعد من يوم الأمس م
إنك تكبر المشكلة.	إنك تكبر المشكلة
يرغب أن يلتقي بك.	يرغب أن يلتقي بك
إنهم لا يعرفون لماذا حتى.	إنهم لا يعرفون لماذا حتى
سيسعدني مساعدتك أي وقت تحب.	سيسئدنيمساعدتك أي وقد تحب
أَحَبُّ نظريّة علمية إليّ هي أن حلقات زحل مكونة بالكامل من الأمتعة المفقودة.	أحب نظرية علمية إلي هي أن حل قتزح المكوينا بالكامل من الأمت عن المفقودة
سأشتري له قلماً.	سأشتري له قلما
أين المشكلة ؟	أين المشكل
وَلِلَّهِ يَسْجُدُ مَا فِي السَّمَاوَاتِ وَمَا فِي الْأَرْضِ مِنْ دَابَّةٍ وَالْمَلَائِكَةُ وَهُمْ لَا يَسْتَكْبِرُونَ	ولله يسجد ما في السماوات وما في الأرض من دابة والملائكة وهم لا يستكبرون

📚 详细文档

评估

该模型可在Common Voice的阿拉伯语测试数据上进行如下评估：

import torch
import re
import librosa
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "ar"
MODEL_ID = "jonatasgrosman/wav2vec2-large-xlsr-53-arabic"
DEVICE = "cuda"

CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
                  "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
                  "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
                  "、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",
                  "『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\", "º", "−", "^", "'", "ʻ", "ˆ"]

test_dataset = load_dataset("common_voice", LANG_ID, split="test")

wer = load_metric("wer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py
cer = load_metric("cer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py

chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
model.to(DEVICE)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

predictions = [x.upper() for x in result["pred_strings"]]
references = [x.upper() for x in result["sentence"]]

print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")

测试结果：以下表格展示了该模型的字错误率（WER）和字符错误率（CER）。我在2021 - 05 - 14也在其他模型上运行了上述评估脚本。请注意，下表可能显示与已报告结果不同的结果，这可能是由于使用的其他评估脚本的某些特殊性造成的。

模型	字错误率（WER）	字符错误率（CER）
jonatasgrosman/wav2vec2-large-xlsr-53-arabic	39.59%	18.18%
bakrianoo/sinai-voice-ar-stt	45.30%	21.84%
othrif/wav2vec2-large-xlsr-arabic	45.93%	20.51%
kmfoda/wav2vec2-large-xlsr-arabic	54.14%	26.07%
mohammed/wav2vec2-large-xlsr-arabic	56.11%	26.79%
anas/wav2vec2-large-xlsr-arabic	62.02%	27.09%
elgeish/wav2vec2-large-xlsr-53-arabic	100.00%	100.56%

📄 许可证

本项目采用Apache - 2.0许可证。

🔧 技术细节

文档未提及技术实现细节，暂不展示。

📚 引用

如果您想引用此模型，可以使用以下内容：

@misc{grosman2021xlsr53-large-arabic,
  title={Fine-tuned {XLSR}-53 large model for speech recognition in {A}rabic},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-arabic}},
  year={2021}
}