wav2vec2-large-voxrex-swedish開源瑞典語語音識別模型

首頁

Wav2vec2 Large Voxrex Swedish

由KBLab開發

基於VoxRex大型模型微調的瑞典語自動語音識別模型，支持16kHz採樣率語音輸入

語音識別

Transformers

其他#瑞典語語音識別 #低詞錯誤率(WER)#廣播語音適配

下載量 101.28k

發布時間 : 3/2/2022

模型概述

該模型是專為瑞典語優化的自動語音識別(ASR)系統，基於Facebook的Wav2vec 2.0架構，在瑞典語廣播、NST和通用語音數據集上進行了微調。

模型特點

高性能瑞典語識別

在NST+通用語音測試集上達到2.5% WER，在通用語音測試集上達到8.49% WER

支持語言模型增強

使用4-gram語言模型可將WER從8.49%降至7.37%

多數據集訓練

結合了瑞典語廣播、NST和通用語音數據集進行訓練

模型能力

瑞典語語音識別

16kHz音頻處理

無語言模型直接使用

使用案例

語音轉文字

廣播內容轉錄

將瑞典語廣播內容自動轉換為文字

在廣播數據集上表現優異

語音助手

為瑞典語語音助手提供語音識別能力

🚀 Wav2vec 2.0 large VoxRex Swedish (C)

這是基於KB的VoxRex large模型的微調版本，使用了瑞典廣播、NST和Common Voice的數據進行微調。在不使用語言模型的情況下進行評估，結果如下：NST + Common Voice測試集（佔總句子的2%）的字錯率（WER）為2.5%。Common Voice測試集的直接字錯率為8.49%，使用4-gram語言模型時為7.37%。

使用此模型時，請確保您的語音輸入採樣率為16kHz。

2022年1月10日更新：更新到VoxRex - C版本。

2022年5月16日更新：相關論文可查看此處。

✨ 主要特性

微調優化：基於特定的瑞典語數據對模型進行微調，提升在瑞典語語音識別任務上的性能。
多數據集支持：使用了Common Voice、NST_Swedish_ASR_Database和P4等多個數據集進行訓練。
評估指標明確：使用字錯率（WER）作為評估指標，方便衡量模型性能。

📦 安裝指南

文檔未提及具體安裝步驟，故跳過此章節。

💻 使用示例

基礎用法

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
test_dataset = load_dataset("common_voice", "sv-SE", split="test[:2%]")
processor = Wav2Vec2Processor.from_pretrained("KBLab/wav2vec2-large-voxrex-swedish")
model = Wav2Vec2ForCTC.from_pretrained("KBLab/wav2vec2-large-voxrex-swedish")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

📚 詳細文檔

性能表現

Comparison

~~*圖表展示的是未對Common Voice進行額外20k步微調時的性能~~

訓練過程

此模型在NST + CommonVoice上進行了120000次更新的微調。之後僅在CommonVoice上進行了額外的20000次更新。在CommonVoice上的額外微調在一定程度上影響了NST + CommonVoice測試集的性能，不出所料地提升了CommonVoice測試集的性能。不過總體來看，它的表現似乎更好[需要引用]。

WER during training

評估指標

屬性	詳情
模型類型	Wav2vec 2.0 large VoxRex Swedish (C)
訓練數據	common_voice、NST_Swedish_ASR_Database、P4
評估指標	字錯率（WER）
Common Voice測試集WER（無語言模型）	8.49%
Common Voice測試集WER（4 - gram語言模型）	7.37%
NST + Common Voice測試集WER	2.5%

引用信息

https://arxiv.org/abs/2205.03026

@misc{malmsten2022hearing,
      title={Hearing voices at the National Library -- a speech corpus and acoustic model for the Swedish language}, 
      author={Martin Malmsten and Chris Haffenden and Love Börjeson},
      year={2022},
      eprint={2205.03026},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}