bp400-xlsr開源語音識別模型 - 免費部署支持巴西葡萄牙語自動識別

首頁

Bp400 Xlsr

由lgris開發

基於巴西葡萄牙語數據集微調的Wav2vec 2.0語音識別模型，支持巴西葡萄牙語自動語音識別任務。

語音識別

Transformers

其他開源協議:Apache-2.0 #巴西葡萄牙語語音識別 #多數據集訓練 #低WER

下載量 55

發布時間 : 3/2/2022

模型概述

該模型是針對巴西葡萄牙語優化的自動語音識別(ASR)系統，基於Wav2vec 2.0架構，在多個巴西葡萄牙語數據集上進行了微調。

模型特點

多數據集訓練

模型融合了7個巴西葡萄牙語數據集，包括CETUC、Common Voice等，總計超過400小時的訓練數據。

語言模型支持

可結合4-gram語言模型進一步提升識別準確率，平均WER從12.4%降至10.5%。

高準確率

在多個測試集上表現優異，CETUC測試集WER低至3.0%，Common Voice測試集WER為9.6%。

模型能力

巴西葡萄牙語語音識別

音頻轉錄

語音轉文本

使用案例

語音轉錄

巴西葡萄牙語語音轉錄

將巴西葡萄牙語語音內容轉換為文本

在CETUC數據集上達到3.0% WER的高準確率

語音助手

巴西葡萄牙語語音指令識別

用於巴西葡萄牙語語音助手系統中的指令識別

🚀 bp400-xlsr：基於巴西葡萄牙語（BP）數據集的Wav2vec 2.0模型

本項目展示了一個針對巴西葡萄牙語微調的Wav2vec模型，使用了以下數據集：

CETUC：包含約145小時的巴西葡萄牙語語音，由50名男性和50名女性發音人朗讀，每人朗讀約1000個從CETEN - Folha語料庫中選出的語音平衡句子。
Common Voice 7.0：由Mozilla基金會發起的項目，旨在創建多種語言的開放數據集。在該項目中，志願者通過官方網站捐贈並驗證語音數據。
[Lapsbm](https://github.com/falabrasil/gitlab - resources)：“Falabrasil - UFPA”是Fala Brasil團隊用於評估巴西葡萄牙語自動語音識別（ASR）系統的數據集。包含35名發音人（其中10名女性），每人朗讀20個獨特的句子，總計700條巴西葡萄牙語語音。音頻以22.05 kHz錄製，未進行環境控制。
Multilingual Librispeech (MLS)：一個多語言的大規模數據集，基於LibriVox等公共領域的有聲讀物錄製。該數據集包含多種語言的總計6000小時轉錄數據。本項目使用的葡萄牙語數據集（主要是巴西變體）約有284小時語音，來自62名發音人朗讀的55本有聲讀物。
Multilingual TEDx：包含8種源語言的TEDx演講音頻記錄。其中的葡萄牙語數據集（主要是巴西葡萄牙語變體）包含164小時的轉錄語音。
Sidney (SID)：包含72名發音人（20名女性）錄製的5777條語音，發音人年齡在17至59歲之間，數據集還包含發音人的出生地、年齡、性別、教育程度和職業等信息。
VoxForge：旨在為聲學模型構建開放數據集的項目。該語料庫包含約100名發音人和4130條巴西葡萄牙語語音，採樣率從16kHz到44.1kHz不等。

這些數據集被合併以構建一個更大的巴西葡萄牙語數據集。除了Common Voice的開發集和測試集分別用於驗證和測試外，所有數據都用於訓練。我們還為所有收集的數據集創建了測試集。

數據集	訓練集時長	驗證集時長	測試集時長
CETUC	93.9h	--	5.4h
Common Voice	37.6h	8.9h	9.5h
LaPS BM	0.8h	--	0.1h
MLS	161.0h	--	3.7h
Multilingual TEDx (Portuguese)	144.2h	--	1.8h
SID	5.0h	--	1.0h
VoxForge	2.8h	--	0.1h
總計	437.2h	8.9h	21.6h

原始模型使用fairseq進行微調。本項目使用的是原始模型的轉換版本，原始fairseq模型的鏈接可[在此處](https://drive.google.com/drive/folders/1eRUExXRF2XK8JxUjIzbLBkLa5wuR3nig?usp = sharing)獲取。

模型指標總結

	CETUC	CV	LaPS	MLS	SID	TEDx	VF	AVG
bp_400（以下有演示）	0.052	0.140	0.074	0.117	0.121	0.245	0.118	0.124
bp_400 + 3 - gram	0.033	0.095	0.046	0.123	0.112	0.212	0.123	0.106
bp_400 + 4 - gram（以下有演示）	0.030	0.096	0.043	0.106	0.118	0.229	0.117	0.105
bp_400 + 5 - gram	0.033	0.094	0.043	0.123	0.111	0.210	0.123	0.105
bp_400 + Transf.	0.032	0.092	0.036	0.130	0.115	0.215	0.125	0.106

轉錄示例

原文	轉錄結果
alguém sabe a que horas começa o jantar	alguém sabe a que horas começo jantar
lila covas ainda não sabe o que vai fazer no fundo	lilacovas ainda não sabe o que vai fazer no fundo
que tal um pouco desse bom spaghetti	quetá um pouco deste bom ispaguete
hong kong em cantonês significa porto perfumado	rongkong en cantones significa porto perfumado
vamos hackear esse problema	vamos rackar esse problema
apenas a poucos metros há uma estação de ônibus	apenas ha poucos metros á uma estação de ônibus
relâmpago e trovão sempre andam juntos	relampagotrevão sempre andam juntos

🚀 快速開始

模型使用示例

MODEL_NAME = "lgris/bp400-xlsr"

導入依賴庫

%%capture
!pip install torch==1.8.2+cu111 torchvision==0.9.2+cu111 torchaudio===0.8.2 -f https://download.pytorch.org/whl/lts/1.8/torch_lts.html
!pip install datasets
!pip install jiwer
!pip install transformers
!pip install soundfile
!pip install pyctcdecode
!pip install https://github.com/kpu/kenlm/archive/master.zip

import jiwer
import torchaudio
from datasets import load_dataset, load_metric
from transformers import (
    Wav2Vec2ForCTC,
    Wav2Vec2Processor,
)
from pyctcdecode import build_ctcdecoder
import torch
import re
import sys

輔助函數

chars_to_ignore_regex = '[\,\?\.\!\;\:\"]'  # noqa: W605

def map_to_array(batch):
    speech, _ = torchaudio.load(batch["path"])
    batch["speech"] = speech.squeeze(0).numpy() 
    batch["sampling_rate"] = 16_000 
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower().replace("’", "'")
    batch["target"] = batch["sentence"]
    return batch

def calc_metrics(truths, hypos):
    wers = []
    mers = []
    wils = []
    for t, h in zip(truths, hypos):
        try:
            wers.append(jiwer.wer(t, h))
            mers.append(jiwer.mer(t, h))
            wils.append(jiwer.wil(t, h))
        except: # 空字符串情況
            pass
    wer = sum(wers)/len(wers)
    mer = sum(mers)/len(mers)
    wil = sum(wils)/len(wils)
    return wer, mer, wil

def load_data(dataset):
    data_files = {'test': f'{dataset}/test.csv'}
    dataset = load_dataset('csv', data_files=data_files)["test"]
    return dataset.map(map_to_array)

模型定義

class STT:

    def __init__(self, 
                 model_name, 
                 device='cuda' if torch.cuda.is_available() else 'cpu', 
                 lm=None):
        self.model_name = model_name
        self.model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device)
        self.processor = Wav2Vec2Processor.from_pretrained(model_name)
        self.vocab_dict = self.processor.tokenizer.get_vocab()
        self.sorted_dict = {
            k.lower(): v for k, v in sorted(self.vocab_dict.items(), 
                                            key=lambda item: item[1])
        }
        self.device = device
        self.lm = lm
        if self.lm:            
            self.lm_decoder = build_ctcdecoder(
                list(self.sorted_dict.keys()),
                self.lm
            )

    def batch_predict(self, batch):
        features = self.processor(batch["speech"], 
                                  sampling_rate=batch["sampling_rate"][0], 
                                  padding=True, 
                                  return_tensors="pt")
        input_values = features.input_values.to(self.device)
        attention_mask = features.attention_mask.to(self.device)
        with torch.no_grad():
            logits = self.model(input_values, attention_mask=attention_mask).logits
        if self.lm:
            logits = logits.cpu().numpy()
            batch["predicted"] = []
            for sample_logits in logits:
                batch["predicted"].append(self.lm_decoder.decode(sample_logits))
        else:
            pred_ids = torch.argmax(logits, dim=-1)
            batch["predicted"] = self.processor.batch_decode(pred_ids)
        return batch

下載數據集

%%capture
!gdown --id 1HFECzIizf-bmkQRLiQD0QVqcGtOG5upI
!mkdir bp_dataset
!unzip bp_dataset -d bp_dataset/

測試

基礎測試

stt = STT(MODEL_NAME)

CETUC數據集測試

ds = load_data('cetuc_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CETUC WER:", wer)

輸出結果：

CETUC WER: 0.05159104708285062

Common Voice數據集測試

ds = load_data('commonvoice_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CV WER:", wer)

輸出結果：

CV WER: 0.14031426198658084

LaPS數據集測試

ds = load_data('lapsbm_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Laps WER:", wer)

輸出結果：

Laps WER: 0.07432133838383838

MLS數據集測試

ds = load_data('mls_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("MLS WER:", wer)

輸出結果：

MLS WER: 0.11678793514817509

SID數據集測試

ds = load_data('sid_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Sid WER:", wer)

輸出結果：

Sid WER: 0.12152357273433984

TEDx數據集測試

ds = load_data('tedx_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("TEDx WER:", wer)

輸出結果：

TEDx WER: 0.24666815906766504

VoxForge數據集測試

ds = load_data('voxforge_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("VoxForge WER:", wer)

輸出結果：

VoxForge WER: 0.11873106060606062

使用語言模型（LM）的測試

!rm -rf ~/.cache
!gdown --id 1GJIKseP5ZkTbllQVgOL98R4yYAcIySFP  # 使用維基百科訓練的模型
stt = STT(MODEL_NAME, lm='pt - BR - wiki.word.4 - gram.arpa')
# !gdown --id 1dLFldy7eguPtyJj5OAlI4Emnx0BpFywg  # 使用巴西葡萄牙語數據訓練的模型
# stt = STT(MODEL_NAME, lm='pt - BR.word.4 - gram.arpa')

CETUC數據集使用LM測試

ds = load_data('cetuc_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CETUC WER:", wer)

輸出結果：

CETUC WER: 0.030266462438593742

Common Voice數據集使用LM測試

ds = load_data('commonvoice_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("CV WER:", wer)

輸出結果：

CV WER: 0.09577710237417715

LaPS數據集使用LM測試

ds = load_data('lapsbm_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Laps WER:", wer)

輸出結果：

Laps WER: 0.043617424242424235

MLS數據集使用LM測試

ds = load_data('mls_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("MLS WER:", wer)

輸出結果：

MLS WER: 0.10642133314350002

SID數據集使用LM測試

ds = load_data('sid_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("Sid WER:", wer)

輸出結果：

Sid WER: 0.11839021001747055

TEDx數據集使用LM測試

ds = load_data('tedx_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("TEDx WER:", wer)

輸出結果：

TEDx WER: 0.22929952467810416

VoxForge數據集使用LM測試

ds = load_data('voxforge_dataset')
result = ds.map(stt.batch_predict, batched=True, batch_size=8) 
wer, mer, wil = calc_metrics(result["sentence"], result["predicted"])
print("VoxForge WER:", wer)

輸出結果：

VoxForge WER: 0.11716314935064935

模型信息表格

屬性	詳情
模型類型	bp400 - xlsr：基於巴西葡萄牙語（BP）數據集的Wav2vec 2.0模型
訓練數據	CETUC、Common Voice 7.0、Lapsbm、Multilingual Librispeech (MLS)、Multilingual TEDx、Sidney (SID)、VoxForge
許可證	apache - 2.0