wav2vec2-large-xlsr-53-es開源語音識別模型 - 精準識別西班牙語語音

首頁

Wav2vec2 Large Xlsr 53 Es

由pcuenq開發

基於Facebook的wav2vec2-large-xlsr-53模型，在西班牙語Common Voice數據集上微調的語音識別模型，測試WER為10.50%。

語音識別

Transformers

西班牙語開源協議:Apache-2.0 #西班牙語語音識別 #低WER #XLSR微調

下載量 147

發布時間 : 3/2/2022

模型概述

這是一個針對西班牙語優化的自動語音識別(ASR)模型，能夠將西班牙語語音轉換為文本。

模型特點

低詞錯誤率

在Common Voice西班牙語測試集上達到10.50%的WER

保留變音符號

保留了西班牙語中的變音符號，確保語義準確性

無需語言模型

可直接使用，無需額外語言模型支持

多階段訓練

採用分階段訓練策略，逐步優化模型性能

模型能力

西班牙語語音識別

16kHz音頻處理

批量語音轉文本

使用案例

語音轉錄

西班牙語語音轉文字

將西班牙語語音內容轉換為文本格式

準確率約89.5% (WER 10.5%)

語音助手

西班牙語語音指令識別

用於西班牙語語音助手的基礎識別組件

🚀 西班牙文Wav2Vec2-Large-XLSR-53模型

本項目基於Common Voice數據集，在西班牙文上對facebook/wav2vec2-large-xlsr-53模型進行了微調。使用該模型時，請確保語音輸入的採樣率為16kHz。

🚀 快速開始

本模型可直接使用（無需語言模型），具體操作如下：

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "es", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")
model = Wav2Vec2ForCTC.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

✨ 主要特性

基於大規模預訓練模型facebook/wav2vec2-large-xlsr-53進行微調，適用於西班牙文語音識別任務。
對Common Voice數據集中的西班牙文數據進行了處理，去除了大量非西班牙文的字符，保留了西班牙文的變音符號，在準確性和語義理解上取得了較好的平衡。
通過多次調整訓練參數和策略，不斷優化模型的WER（詞錯誤率），最終達到了約10.5%的測試WER。

💻 使用示例

基礎用法

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "es", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")
model = Wav2Vec2ForCTC.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

高級用法

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "es", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")
model = Wav2Vec2ForCTC.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")
model.to("cuda")

## Text pre-processing

chars_to_ignore_regex = '[\,\¿\?\.\¡\!\-\;\:\"\“\%\‘\”\\…\’\ː\'\‹\›\`\´\®\—\→]'
chars_to_ignore_pattern = re.compile(chars_to_ignore_regex)

def remove_special_characters(batch):
    batch["sentence"] = chars_to_ignore_pattern.sub('', batch["sentence"]).lower() + " "
    return batch

def replace_diacritics(batch):
    sentence = batch["sentence"]
    sentence = re.sub('ì', 'í', sentence)
    sentence = re.sub('ù', 'ú', sentence)
    sentence = re.sub('ò', 'ó', sentence)
    sentence = re.sub('à', 'á', sentence)
    batch["sentence"] = sentence
    return batch

def replace_additional(batch):
    sentence = batch["sentence"]
    sentence = re.sub('ã', 'a', sentence)   # Portuguese, as in São Paulo
    sentence = re.sub('ō', 'o', sentence)   # Japanese
    sentence = re.sub('ê', 'e', sentence)   # Português
    batch["sentence"] = sentence
    return batch

## Audio pre-processing

# I tried to perform the resampling using a `torchaudio` `Resampler` transform,
# but found that the process deadlocked when using multiple processes.
# Perhaps my torchaudio is using the wrong sox library under the hood, I'm not sure.
# Fortunately, `librosa` seems to work fine, so that's what I'll use for now.

import librosa
def speech_file_to_array_fn(batch):
    speech_array, sample_rate = torchaudio.load(batch["path"])
    batch["speech"] = librosa.resample(speech_array.squeeze().numpy(), sample_rate, 16_000)
    return batch

# One-pass mapping function

# Text transformation and audio resampling
def cv_prepare(batch):
    batch = remove_special_characters(batch)
    batch = replace_diacritics(batch)
    batch = replace_additional(batch)
    batch = speech_file_to_array_fn(batch)
    return batch

# Number of CPUs or None
num_proc = 16

test_dataset = test_dataset.map(cv_prepare, remove_columns=['path'], num_proc=num_proc)

def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

# WER Metric computation
# `wer.compute` crashes in my computer with more than ~10000 samples.
# Until I confirm in a different one, I created a "chunked" version of the computation.
# It gives the same results as `wer.compute` for smaller datasets.

import jiwer

def chunked_wer(targets, predictions, chunk_size=None):                                          
    if chunk_size is None: return jiwer.wer(targets, predictions)                                
    start = 0                                                                                    
    end = chunk_size                                                                             
    H, S, D, I = 0, 0, 0, 0                                                                      
    while start < len(targets):                                                                  
        chunk_metrics = jiwer.compute_measures(targets[start:end], predictions[start:end])       
        H = H + chunk_metrics["hits"]                                                            
        S = S + chunk_metrics["substitutions"]                                                   
        D = D + chunk_metrics["deletions"]                                                       
        I = I + chunk_metrics["insertions"]                                                      
        start += chunk_size                                                                      
        end += chunk_size                                                                        
    return float(S + D + I) / float(H + S + D)

print("WER: {:2f}".format(100 * chunked_wer(result["sentence"], result["pred_strings"], chunk_size=4000)))
#print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

📚 詳細文檔

文本處理

Common Voice的西班牙文數據集即使在去除分隔符和標點符號後，仍包含大量非西班牙文的字符。因此，對這些字符進行了處理，去除了大部分無關字符，並保留了所有西班牙文的變音符號。雖然僅使用非重音字符可能會獲得更好的WER分數，但保留變音符號在語義理解上更為準確。具體的處理規則在評估腳本中有所體現。

訓練過程

使用了Common Voice的train和validation數據集進行訓練。為了更好地觀察訓練進度並及時調整策略，最初將train和validation數據集按10%進行分割。具體訓練過程如下：

僅在第一個分割數據集上訓練30個epoch，使用與Patrick在演示筆記本中類似的參數，批量大小為24，梯度累積步數為2，在完整測試集上的WER約為16.3%。
在剩餘的9個分割數據集上分別訓練3個epoch，使用更快的75步熱身，WER約為11.7%。
在10個分割數據集上分別訓練3個epoch，使用較小的學習率1e-4和75步熱身，最終模型的WER約為11.7%。
使用完整數據集進行訓練，選擇帶有硬重啟的餘弦調度，參考學習率為3e-5，訓練10個epoch，無熱身，最終WER約為10.5%。

其他嘗試

從相同的微調模型開始，比較了恆定學習率1e-4和帶熱身的線性調度，線性調度效果更好（WER分別為11.85%和12.72%）。
嘗試使用西班牙文模型改進巴斯克文模型，但未取得效果。
標籤平滑方法未起作用。

問題與技術挑戰

Datasets抽象基於內存映射文件，可處理任意大小的數據集，但需要了解其侷限性和權衡。緩存使用方便，但磁盤空間消耗快，需要了解緩存文件的存儲方式，並在必要時手動保存數據。
訓練開始前存在明顯延遲，已找到原因並討論出解決方案。
wer.compute在處理大數據集時會崩潰，因此實現了一個分塊計算的版本。
torchaudio在使用多進程時會出現死鎖問題，目前使用librosa進行重採樣。
在筆記本中使用num_proc時，無法看到進度條，可能是權限問題。