wav2vec2-xls-r-1b-tevr開源德語語音識別模型，低錯誤率精準識別語音

首頁

Wav2vec2 Xls R 1b Tevr

由fxtentacle開發

這是一個德語語音識別模型，採用wav2vec 2.0 XLS-R 1B架構並引入TEVR（標記熵方差降低）技術，結合5-gram語言模型，在Common Voice德語測試集上取得了3.64%的詞錯誤率。

語音識別

Transformers

德語開源協議:Apache-2.0 #德語語音識別 #TEVR增強技術 #超低詞錯誤率

下載量 311

發布時間 : 6/2/2022

模型概述

該模型是一個高性能德語自動語音識別系統，通過TEVR技術優化了標記生成過程，顯著提升了識別準確率。

模型特點

TEVR技術增強

通過標記熵方差降低技術優化語音識別性能，提升模型準確率

高性能語言模型集成

結合5-gram KenLM語言模型，顯著降低識別錯誤率

德語優化

專門針對德語語音特點進行優化，處理德語特有字符和發音

模型能力

德語語音轉文本

高精度語音識別

即時語音處理

使用案例

語音轉錄

德語會議記錄

將德語會議錄音自動轉換為文字記錄

詞錯誤率低至3.64%

語音助手

為德語語音助手提供高精度語音識別能力

無障礙技術

即時字幕生成

為德語視頻內容生成即時字幕

🚀 德語語音識別模型

本項目提供了一個完整訓練的德語語音識別管道，結合了先進的聲學模型和語言模型，能夠高效準確地識別德語語音，在CommonVoice德語數據集上取得了優異的成績。

📚 詳細文檔

概述

此文件夾包含一個經過完全訓練的德語語音識別管道，該管道由使用新型wav2vec 2.0 XLS - R 1B TEVR 架構的聲學模型和一個5 - 元KenLM語言模型組成。有關TEVR增強功能及其動機的解釋，請參閱我們的論文：TEVR: Improving Speech Recognition by Token Entropy Variance Reduction。

截至2022年6月，該管道在CommonVoice德語數據集上的單詞錯誤率（WER）為3.64% ，表現極具競爭力。字符錯誤率（CER）為1.54%。

引用

如果您將此自動語音識別（ASR）管道用於研究，請引用以下文獻：

@misc{https://doi.org/10.48550/arxiv.2206.12693,
  doi = {10.48550/ARXIV.2206.12693},
  url = {https://arxiv.org/abs/2206.12693},
  author = {Krabbenhöft, Hajo Nils and Barth, Erhardt},  
  keywords = {Computation and Language (cs.CL), Sound (cs.SD), Audio and Speech Processing (eess.AS), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Electrical engineering, electronic engineering, information engineering, FOS: Electrical engineering, electronic engineering, information engineering, F.2.1; I.2.6; I.2.7},  
  title = {TEVR: Improving Speech Recognition by Token Entropy Variance Reduction},  
  publisher = {arXiv},  
  year = {2022}, 
  copyright = {Creative Commons Attribution 4.0 International}
}

TEVR分詞器創建/測試

有關以下內容，請參閱https://huggingface.co/fxtentacle/tevr-token-entropy-predictor-de：

我們訓練的用於計算論文中熵的ByT5模型
一個從文本語料庫生成TEVR分詞器的Jupyter Notebook
一個生成論文中插圖的Jupyter Notebook

評估

若要自己評估此管道和/或在您自己的數據上進行評估，請查看HF Eval Script.ipynb Jupyter Notebook，或使用以下Python腳本：

💻 使用示例

基礎用法

!pip install --quiet --root-user-action=ignore --upgrade pip
!pip install --quiet --root-user-action=ignore "datasets>=1.18.3" "transformers==4.11.3" librosa jiwer huggingface_hub  
!pip install --quiet --root-user-action=ignore https://github.com/kpu/kenlm/archive/master.zip pyctcdecode
!pip install --quiet --root-user-action=ignore --upgrade transformers
!pip install --quiet --root-user-action=ignore torch_audiomentations audiomentations

from datasets import load_dataset, Audio, load_metric
from transformers import AutoModelForCTC, Wav2Vec2ProcessorWithLM
import torchaudio.transforms as T
import torch
import unicodedata
import numpy as np
import re

# load testing dataset 
testing_dataset = load_dataset("common_voice", "de", split="test")

# replace invisible characters with space
allchars = list(set([c for t in testing_dataset['sentence'] for c in list(t)]))
map_to_space = [c for c in allchars if unicodedata.category(c)[0] in 'PSZ' and c not in 'ʻ-']
replacements = ''.maketrans(''.join(map_to_space), ''.join(' ' for i in range(len(map_to_space))), '\'ʻ')

def text_fix(text):
    # change ß to ss
    text = text.replace('ß','ss')
    # convert dash to space and remove double-space
    text = text.replace('-',' ').replace('  ',' ').replace('  ',' ')
    # make lowercase
    text = text.lower()
    # remap all invisible characters to space
    text = text.translate(replacements).strip()
    # for easier comparison to Zimmermeister, replace unrepresentable characters with ?
    text = re.sub("[âşěýňעảנźțãòàǔł̇æồאắîשðșęūāñë生בøúıśžçćńřğ]+","?",text)
    # remove multiple spaces (again)
    text = ' '.join([w for w in text.split(' ') if w != ''])
    return text

# load model
model = AutoModelForCTC.from_pretrained("fxtentacle/wav2vec2-xls-r-1b-tevr")
model.to('cuda')
# load processor
class HajoProcessor(Wav2Vec2ProcessorWithLM):
    @staticmethod
    def get_missing_alphabet_tokens(decoder, tokenizer):
        return []
processor = HajoProcessor.from_pretrained("fxtentacle/wav2vec2-xls-r-1b-tevr")

# this function will be called for each WAV file
def predict_single_audio(batch, image=False):    
    audio = batch['audio']['array']
    # resample, if needed
    if batch['audio']['sampling_rate'] != 16000:
        audio = T.Resample(orig_freq=batch['audio']['sampling_rate'], new_freq=16000)(torch.from_numpy(audio)).numpy()
    # normalize
    audio = (audio - audio.mean()) / np.sqrt(audio.var() + 1e-7)
    # ask HF processor to prepare audio for GPU eval
    input_values = processor(audio, return_tensors="pt", sampling_rate=16_000).input_values
    # call model on GPU
    with torch.no_grad():
        logits = model(input_values.to('cuda')).logits.cpu().numpy()[0]
    # ask HF processor to decode logits
    decoded = processor.decode(logits, beam_width=500)
    # return as dictionary
    return { 'groundtruth': text_fix(batch['sentence']), 'prediction': decoded.text }

# process all audio files
all_predictions = testing_dataset.map(predict_single_audio, remove_columns=testing_dataset.column_names)

# print results
print('WER', load_metric("wer").compute(predictions=all_predictions['prediction'], references=all_predictions['groundtruth'])*100.0, '%')
print('CER', load_metric("cer").compute(predictions=all_predictions['prediction'], references=all_predictions['groundtruth'])*100.0, '%')

運行上述腳本後，輸出結果如下：

WER 3.6433399042523233 %
CER 1.5398893560981173 %

📄 許可證

本項目採用Apache 2.0許可證。

📋 模型信息

屬性	詳情
模型類型	wav2vec 2.0 XLS - R 1B + TEVR tokens + 5 - gram LM by Hajo Nils Krabbenhöft
訓練數據	Common Voice
評估指標	單詞錯誤率（WER）、字符錯誤率（CER）
測試WER	3.6433399042523233
測試CER	1.5398893560981173