wav2vec2-large-englishオープンソース自動音声認識モデル

Home

Wav2vec2 Large English

Developed by jonatasgrosman

facebook/wav2vec2-largeを英語用に微調整した自動音声認識モデル、Common Voice 6.1データセットでトレーニング

音声認識

Transformers

EnglishOpen Source License:Apache-2.0 #英語音声認識 #低単語誤り率 #汎用音声適応

Downloads 355

Release Time : 3/2/2022

Model Overview

英語音声認識タスク向けに最適化されたwav2vec2大型モデル、16kHzサンプリングレートの音声入力をサポート

Model Features

高性能英語認識

Common Voice英語テストセットで21.53% WERと9.66% CERを達成

大型事前学習モデルベース

facebook/wav2vec2-largeモデルを微調整、強力な音声特徴抽出能力を有する

16kHzサンプリングレートサポート

16kHzサンプリングレートの音声入力に最適化

Model Capabilities

英語音声認識

音声からテキストへの変換

自動音声転写

Use Cases

音声転写

会議議事録自動転写

英語会議録音を自動的に文字記録に変換

約80%の精度(WER指標ベース)

ポッドキャストコンテンツ転写

英語ポッドキャスト番組を自動的にテキストコンテンツに変換

音声アシスタント

英語音声コマンド認識

スマートデバイス向け英語音声コマンド認識システム

🚀 英語音声認識用に微調整されたwav2vec2大規模モデル

このモデルは、自動音声認識のために、Common Voice 6.1 の英語のトレーニングデータと検証データを使用して、facebook/wav2vec2-large を微調整したものです。このモデルを使用する際は、入力音声が16kHzでサンプリングされていることを確認してください。

このモデルのトレーニングには、OVHcloud から提供されたGPUクレジットを利用しています。

トレーニングに使用されたスクリプトはこちらにあります: https://github.com/jonatasgrosman/wav2vec2-sprint

🚀 クイックスタート

このモデルは、言語モデルを使用せずに直接使用することができます。以下に使用方法を示します。

💻 使用例

基本的な使用法

HuggingSound ライブラリを使用する場合:

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-english")
audio_paths = ["/path/to/file.mp3", "/path/to/another_file.wav"]

transcriptions = model.transcribe(audio_paths)

高度な使用法

独自の推論スクリプトを作成する場合:

import torch
import librosa
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "en"
MODEL_ID = "jonatasgrosman/wav2vec2-large-english"
SAMPLES = 10

test_dataset = load_dataset("common_voice", LANG_ID, split=f"test[:{SAMPLES}]")

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = batch["sentence"].upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)
predicted_sentences = processor.batch_decode(predicted_ids)

for i, predicted_sentence in enumerate(predicted_sentences):
    print("-" * 100)
    print("Reference:", test_dataset[i]["sentence"])
    print("Prediction:", predicted_sentence)

参照文	予測文
"SHE'LL BE ALL RIGHT."	SHELL BE ALL RIGHT
SIX	SIX
"ALL'S WELL THAT ENDS WELL."	ALLAS WELL THAT ENDS WELL
DO YOU MEAN IT?	W MEAN IT
THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE, BUT STILL CAUSES REGRESSIONS.	THE NEW PATCH IS LESS INVASIVE THAN THE OLD ONE BUT STILL CAUSES REGRESTION
HOW IS MOZILLA GOING TO HANDLE AMBIGUITIES LIKE QUEUE AND CUE?	HOW IS MOSILLA GOING TO BANDL AND BE WHIT IS LIKE QU AND QU
"I GUESS YOU MUST THINK I'M KINDA BATTY."	RUSTION AS HAME AK AN THE POT
NO ONE NEAR THE REMOTE MACHINE YOU COULD RING?	NO ONE NEAR THE REMOTE MACHINE YOU COULD RING
SAUCE FOR THE GOOSE IS SAUCE FOR THE GANDER.	SAUCE FOR THE GUCE IS SAUCE FOR THE GONDER
GROVES STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD.	GRAFS STARTED WRITING SONGS WHEN SHE WAS FOUR YEARS OLD

🔧 評価

このモデルは、Common Voiceの英語（en）のテストデータで以下のように評価できます。

import torch
import re
import librosa
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

LANG_ID = "en"
MODEL_ID = "jonatasgrosman/wav2vec2-large-english"
DEVICE = "cuda"

CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
                   "؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
                   "{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
                   "、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",
                   "『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\", "º", "−", "^", "ʻ", "ˆ"]

test_dataset = load_dataset("common_voice", LANG_ID, split="test")

wer = load_metric("wer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/wer.py
cer = load_metric("cer.py") # https://github.com/jonatasgrosman/wav2vec2-sprint/blob/main/cer.py

chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"

processor = Wav2Vec2Processor.from_pretrained(MODEL_ID)
model = Wav2Vec2ForCTC.from_pretrained(MODEL_ID)
model.to(DEVICE)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")
        speech_array, sampling_rate = librosa.load(batch["path"], sr=16_000)
    batch["speech"] = speech_array
    batch["sentence"] = re.sub(chars_to_ignore_regex, "", batch["sentence"]).upper()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to(DEVICE), attention_mask=inputs.attention_mask.to(DEVICE)).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

predictions = [x.upper() for x in result["pred_strings"]]
references = [x.upper() for x in result["sentence"]]

print(f"WER: {wer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")
print(f"CER: {cer.compute(predictions=predictions, references=references, chunk_size=1000) * 100}")

テスト結果:

以下の表に、このモデルの単語誤り率（WER）と文字誤り率（CER）を示します。私は2021年6月17日に、上記の評価スクリプトを他のモデルにも適用しました。ただし、以下の表の結果は、既に報告されている結果と異なる場合があります。これは、使用された他の評価スクリプトの特性によるものです。

モデル	WER	CER
jonatasgrosman/wav2vec2-large-xlsr-53-english	18.98%	8.29%
jonatasgrosman/wav2vec2-large-english	21.53%	9.66%
facebook/wav2vec2-large-960h-lv60-self	22.03%	10.39%
facebook/wav2vec2-large-960h-lv60	23.97%	11.14%
boris/xlsr-en-punctuation	29.10%	10.75%
facebook/wav2vec2-large-960h	32.79%	16.03%
facebook/wav2vec2-base-960h	39.86%	19.89%
facebook/wav2vec2-base-100h	51.06%	25.06%
elgeish/wav2vec2-large-lv60-timit-asr	59.96%	34.28%
facebook/wav2vec2-base-10k-voxpopuli-ft-en	66.41%	36.76%
elgeish/wav2vec2-base-timit-asr	68.78%	36.81%

📚 引用

このモデルを引用する場合は、以下のようにしてください。

@misc{grosman2021wav2vec2-large-english,
  title={Fine-tuned wav2vec2 large model for speech recognition in {E}nglish},
  author={Grosman, Jonatas},
  howpublished={\url{https://huggingface.co/jonatasgrosman/wav2vec2-large-english}},
  year={2021}
}