wav2vec2-large-xlsr-53-esオープンソース音声認識モデル - スペイン語の音声を正確に認識

ホーム

Wav2vec2 Large Xlsr 53 Es

pcuenqによって開発

Facebookのwav2vec2-large-xlsr-53モデルをベースに、スペイン語Common Voiceデータセットで微調整した音声認識モデルで、テストWERは10.50%です。

音声認識

Transformers

スペイン語オープンソースライセンス:Apache-2.0 #スペイン語音声認識 #低WER #XLSR微調整

ダウンロード数 147

リリース時間 : 3/2/2022

モデル概要

これはスペイン語に最適化された自動音声認識(ASR)モデルで、スペイン語の音声をテキストに変換することができます。

モデル特徴

低単語誤り率

Common Voiceスペイン語テストセットで10.50%のWERを達成しました

変音符号を保持

スペイン語の変音符号を保持し、意味の正確性を確保します

言語モデル不要

直接使用でき、追加の言語モデルのサポートは必要ありません

多段階訓練

段階的な訓練戦略を採用し、モデルの性能を段階的に最適化します

モデル能力

スペイン語音声認識

16kHzオーディオ処理

バッチ音声テキスト変換

使用事例

音声文字起こし

スペイン語音声をテキストに変換

スペイン語の音声内容をテキスト形式に変換します

精度約89.5% (WER 10.5%)

音声アシスタント

スペイン語音声指令認識

スペイン語音声アシスタントの基本的な認識コンポーネントに使用されます

🚀 Wav2Vec2-Large-XLSR-53-Spanish

facebook/wav2vec2-large-xlsr-53 を Common Voice データセットを使用してスペイン語でファインチューニングしたモデルです。このモデルを使用する際には、音声入力が16kHzでサンプリングされていることを確認してください。

🚀 クイックスタート

このモデルは、言語モデルを使用せずに直接使用できます。以下に使用例を示します。

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "es", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")
model = Wav2Vec2ForCTC.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

💻 使用例

基本的な使用法

import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "es", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")
model = Wav2Vec2ForCTC.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the audio files as arrays
def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])

高度な使用法

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "es", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")
model = Wav2Vec2ForCTC.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")
model.to("cuda")

## Text pre-processing

chars_to_ignore_regex = '[\,\¿\?\.\¡\!\-\;\:\"\“\%\‘\”\\…\’\ː\'\‹\›\`\´\®\—\→]'
chars_to_ignore_pattern = re.compile(chars_to_ignore_regex)

def remove_special_characters(batch):
    batch["sentence"] = chars_to_ignore_pattern.sub('', batch["sentence"]).lower() + " "
    return batch

def replace_diacritics(batch):
    sentence = batch["sentence"]
    sentence = re.sub('ì', 'í', sentence)
    sentence = re.sub('ù', 'ú', sentence)
    sentence = re.sub('ò', 'ó', sentence)
    sentence = re.sub('à', 'á', sentence)
    batch["sentence"] = sentence
    return batch

def replace_additional(batch):
    sentence = batch["sentence"]
    sentence = re.sub('ã', 'a', sentence)   # Portuguese, as in São Paulo
    sentence = re.sub('ō', 'o', sentence)   # Japanese
    sentence = re.sub('ê', 'e', sentence)   # Português
    batch["sentence"] = sentence
    return batch

## Audio pre-processing

# I tried to perform the resampling using a `torchaudio` `Resampler` transform,
# but found that the process deadlocked when using multiple processes.
# Perhaps my torchaudio is using the wrong sox library under the hood, I'm not sure.
# Fortunately, `librosa` seems to work fine, so that's what I'll use for now.

import librosa
def speech_file_to_array_fn(batch):
    speech_array, sample_rate = torchaudio.load(batch["path"])
    batch["speech"] = librosa.resample(speech_array.squeeze().numpy(), sample_rate, 16_000)
    return batch

# One-pass mapping function

# Text transformation and audio resampling
def cv_prepare(batch):
    batch = remove_special_characters(batch)
    batch = replace_diacritics(batch)
    batch = replace_additional(batch)
    batch = speech_file_to_array_fn(batch)
    return batch

# Number of CPUs or None
num_proc = 16

test_dataset = test_dataset.map(cv_prepare, remove_columns=['path'], num_proc=num_proc)

def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

# WER Metric computation
# `wer.compute` crashes in my computer with more than ~10000 samples.
# Until I confirm in a different one, I created a "chunked" version of the computation.
# It gives the same results as `wer.compute` for smaller datasets.

import jiwer

def chunked_wer(targets, predictions, chunk_size=None):                                          
    if chunk_size is None: return jiwer.wer(targets, predictions)                                
    start = 0                                                                                    
    end = chunk_size                                                                             
    H, S, D, I = 0, 0, 0, 0                                                                      
    while start < len(targets):                                                                  
        chunk_metrics = jiwer.compute_measures(targets[start:end], predictions[start:end])       
        H = H + chunk_metrics["hits"]                                                            
        S = S + chunk_metrics["substitutions"]                                                   
        D = D + chunk_metrics["deletions"]                                                       
        I = I + chunk_metrics["insertions"]                                                      
        start += chunk_size                                                                      
        end += chunk_size                                                                        
    return float(S + D + I) / float(H + S + D)

print("WER: {:2f}".format(100 * chunked_wer(result["sentence"], result["pred_strings"], chunk_size=4000)))
#print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

📚 ドキュメント

評価

このモデルは、Common Voiceのスペイン語テストデータで以下のように評価できます。

import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re

test_dataset = load_dataset("common_voice", "es", split="test")
wer = load_metric("wer")

processor = Wav2Vec2Processor.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")
model = Wav2Vec2ForCTC.from_pretrained("pcuenq/wav2vec2-large-xlsr-53-es")
model.to("cuda")

## Text pre-processing

chars_to_ignore_regex = '[\,\¿\?\.\¡\!\-\;\:\"\“\%\‘\”\\…\’\ː\'\‹\›\`\´\®\—\→]'
chars_to_ignore_pattern = re.compile(chars_to_ignore_regex)

def remove_special_characters(batch):
    batch["sentence"] = chars_to_ignore_pattern.sub('', batch["sentence"]).lower() + " "
    return batch

def replace_diacritics(batch):
    sentence = batch["sentence"]
    sentence = re.sub('ì', 'í', sentence)
    sentence = re.sub('ù', 'ú', sentence)
    sentence = re.sub('ò', 'ó', sentence)
    sentence = re.sub('à', 'á', sentence)
    batch["sentence"] = sentence
    return batch

def replace_additional(batch):
    sentence = batch["sentence"]
    sentence = re.sub('ã', 'a', sentence)   # Portuguese, as in São Paulo
    sentence = re.sub('ō', 'o', sentence)   # Japanese
    sentence = re.sub('ê', 'e', sentence)   # Português
    batch["sentence"] = sentence
    return batch

## Audio pre-processing

# I tried to perform the resampling using a `torchaudio` `Resampler` transform,
# but found that the process deadlocked when using multiple processes.
# Perhaps my torchaudio is using the wrong sox library under the hood, I'm not sure.
# Fortunately, `librosa` seems to work fine, so that's what I'll use for now.

import librosa
def speech_file_to_array_fn(batch):
    speech_array, sample_rate = torchaudio.load(batch["path"])
    batch["speech"] = librosa.resample(speech_array.squeeze().numpy(), sample_rate, 16_000)
    return batch

# One-pass mapping function

# Text transformation and audio resampling
def cv_prepare(batch):
    batch = remove_special_characters(batch)
    batch = replace_diacritics(batch)
    batch = replace_additional(batch)
    batch = speech_file_to_array_fn(batch)
    return batch

# Number of CPUs or None
num_proc = 16

test_dataset = test_dataset.map(cv_prepare, remove_columns=['path'], num_proc=num_proc)

def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)

    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=8)

# WER Metric computation
# `wer.compute` crashes in my computer with more than ~10000 samples.
# Until I confirm in a different one, I created a "chunked" version of the computation.
# It gives the same results as `wer.compute` for smaller datasets.

import jiwer

def chunked_wer(targets, predictions, chunk_size=None):                                          
    if chunk_size is None: return jiwer.wer(targets, predictions)                                
    start = 0                                                                                    
    end = chunk_size                                                                             
    H, S, D, I = 0, 0, 0, 0                                                                      
    while start < len(targets):                                                                  
        chunk_metrics = jiwer.compute_measures(targets[start:end], predictions[start:end])       
        H = H + chunk_metrics["hits"]                                                            
        S = S + chunk_metrics["substitutions"]                                                   
        D = D + chunk_metrics["deletions"]                                                       
        I = I + chunk_metrics["insertions"]                                                      
        start += chunk_size                                                                      
        end += chunk_size                                                                        
    return float(S + D + I) / float(H + S + D)

print("WER: {:2f}".format(100 * chunked_wer(result["sentence"], result["pred_strings"], chunk_size=4000)))
#print("WER: {:2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))

テスト結果: 10.50 %

テキスト処理

Common Voiceの es データセットには、区切り文字や句読点を除外しても、スペイン語に属さない文字がたくさん含まれています。私はいくつかの変換を行い、ほとんどの不要な文字を除外しました。

私はすべてのスペイン語のアクセント記号を保持することにしました。これは難しい決断です。アクセント記号は、綴り規則によって付けられることがありますが、単語の意味を変えることはありません。しかし、他の場合には、アクセント記号は意味を持ち、異なる意味を区別する役割を果たします。アクセント記号のない文字だけを使用することで、確かにより良いWERスコアが得られ、結果のテキストはスペイン語話者に理解されるでしょう。しかし、私はそれらを保持する方が「より正しい」と思います。

私が適用したすべての規則は、評価スクリプトに示されています。

トレーニング

Common Voiceの train と validation データセットがトレーニングに使用されました。

データセットの処理上の理由から、最初は train + validation を10%の分割に分け、早期に進捗状況を確認し、必要に応じて対応できるようにしました。

最初の分割のみで30エポックトレーニングしました。パトリックがデモノートブックで提案した値と同様の値を使用しました。バッチサイズは24、勾配累積ステップは2です。これにより、全テストセットで約16.3%のWERが得られました。
その後、得られたモデルを残りの9つの分割でそれぞれ3エポックトレーニングしましたが、ウォームアップを75ステップで高速化しました。
次に、10個の分割それぞれで3エポック、学習率を 1e-4 に設定してトレーニングしました。この場合もウォームアップを75ステップで行いました。最終的なモデルのWERは約11.7%でした。
この時点で、トレーニング時間の初期遅延の原因がわかり、私は全データセットをトレーニングに使用することにしました。しかし、テストでは学習率を変えるとうまくいくことがわかったので、それを再現したいと思いました。ハードリスタート付きのコサインスケジュール、参照学習率 3e-5、10エポックを選択しました。コサインスケジュールも10サイクルに設定し、ウォームアップは使用しませんでした。これにより、約10.5%のWERが得られました。

試した他のこと

同じファインチューニング済みモデルから始めて、固定学習率1e-4とウォームアップ付きの線形スケジュールを比較しました。線形スケジュールの方がうまくいきました（WER%は11.85対12.72）。
スペイン語モデルを使用してバスク語モデルを改善しようとしました。テキストを変換して綴りをターゲット言語に似せましたが、バスク語モデルは改善しませんでした。
ラベルスムージングはうまくいきませんでした。

問題とその他の技術的課題

私は以前、transformers ライブラリをエンドユーザーとして使用し、いくつかのタスクでBertを試したことがありますが、これは初めてコードを調べる必要がありました。

Datasets 抽象化は、メモリマップファイルに基づいているため、任意のサイズのデータセットを処理できるため、素晴らしいです。ただし、その制限とトレードオフを理解することが重要です。キャッシュは便利ですが、ディスク使用量が急速に増えます。私は現在のプロジェクトのデータセットを1TBの高速SSDディスクに保存していますが、数回ディスク容量が不足しました。キャッシュファイルがどのように保存されているかを理解し、キャッシュを無効にして手動で保存するのが最適なタイミングを学ぶ必要がありました。データ探索は小さなデータセットまたはサンプリングされたデータセットに適していますが、実際の処理は、必要な変換を特定して単一の map 操作で適用するときに最も効率的です。
トレーニング開始前に顕著な遅延がありました。幸いなことに、原因を見つけ、Slackとフォーラムで議論し、回避策を作成しました。
WERメトリックは大規模なデータセットでクラッシュしました。小さなサンプルで評価し（高速でもあります）、固定メモリで実行される累積バージョンのWERを作成しました。この変更がトレーニングループ内で使用するのに適切かどうかを確認したいと思います。
torchaudio は複数のプロセスを使用するとデッドロックします。librosa は正常に動作します。調査が必要です。
ノートブック内で num_proc を使用すると、進捗バーが表示されませんでした。これは確かに私のコンピュータの権限の問題です。まだ原因を見つける必要があります。