WavLM - BARTオープンソース自動音声認識モデル - 英語テキスト出力、タイムスタンプ标注、複数話者のセグメント化を無料で実現

ホーム

Wavlm Bart

nguyenvulebinhによって開発

英語の自動音声認識(ASR)をサポートするシーケンス・ツー・シーケンスモデルで、正規化されたテキスト、タイムスタンプ注釈、複数話者セグメンテーションを出力可能です。

音声認識

Transformers

英語#英語音声認識 #タイムスタンプ注釈 #複数話者セグメンテーション

ダウンロード数 24

リリース時間 : 5/23/2023

モデル概要

このモデルはwav2vec2とbartphoアーキテクチャに基づいており、主に英語音声認識タスクに使用され、タイムスタンプ付きテキストと複数話者セグメンテーションの出力をサポートします。

モデル特徴

タイムスタンプ注釈

認識されたテキストに正確なタイムスタンプを注釈可能

複数話者セグメンテーション

異なる話者の音声を認識・セグメント化可能

正規化テキスト出力

正規化されたテキスト結果を出力

モデル能力

英語音声認識

タイムスタンプ注釈

複数話者セグメンテーション

使用事例

音声文字起こし

会議議事録

会議録音をタイムスタンプ付きテキスト記録に変換

発言内容を正確に認識し発言時間点を注釈

インタビュー文字起こし

インタビュー録音を転記し異なる話者を区別

異なるインタビュー対象者の発言を自動セグメント化

🚀 英語音声認識シーケンス-to-シーケンスモデル

このモデルは、正規化されたテキストの出力、タイムスタンプのラベリング、および複数話者のセグメンテーションをサポートしています。

🚀 クイックスタート

この英語音声認識シーケンス-to-シーケンスモデルを使用することで、音声データから正規化されたテキストを出力し、タイムスタンプをラベリングし、複数話者をセグメント化することができます。

💻 使用例

基本的な使用法

# !pip install transformers sentencepiece

from transformers import SpeechEncoderDecoderModel
from transformers import AutoFeatureExtractor, AutoTokenizer, GenerationConfig
import torchaudio
import torch

model_path = 'nguyenvulebinh/wav2vec2-bartpho'
model = SpeechEncoderDecoderModel.from_pretrained(model_path).eval()
feature_extractor = AutoFeatureExtractor.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
if torch.cuda.is_available():
  model = model.cuda()


def decode_tokens(token_ids, skip_special_tokens=True, time_precision=0.02):
    timestamp_begin = tokenizer.vocab_size
    outputs = [[]]
    for token in token_ids:
        if token >= timestamp_begin:
            timestamp = f" |{(token - timestamp_begin) * time_precision:.2f}| "
            outputs.append(timestamp)
            outputs.append([])
        else:
            outputs[-1].append(token)
    outputs = [
        s if isinstance(s, str) else tokenizer.decode(s, skip_special_tokens=skip_special_tokens) for s in outputs
    ]
    return "".join(outputs).replace("< |", "<|").replace("| >", "|>")

def decode_wav(audio_wavs, asr_model, prefix=""):
  device = next(asr_model.parameters()).device
  input_values = feature_extractor.pad(
    [{"input_values": feature} for feature in audio_wavs],
    padding=True,
    max_length=None,
    pad_to_multiple_of=None,
    return_tensors="pt",
  )

  output_beam_ids = asr_model.generate(
    input_values['input_values'].to(device), 
    attention_mask=input_values['attention_mask'].to(device),
    decoder_input_ids=tokenizer.batch_encode_plus([prefix] * len(audio_wavs), return_tensors="pt")['input_ids'][..., :-1].to(device),
    generation_config=GenerationConfig(decoder_start_token_id=tokenizer.bos_token_id),
    max_length=250, 
    num_beams=25, 
    no_repeat_ngram_size=4, 
    num_return_sequences=1, 
    early_stopping=True,
    return_dict_in_generate=True,
    output_scores=True,
  )

  output_text = [decode_tokens(sequence) for sequence in output_beam_ids.sequences]

  return output_text


# https://huggingface.co/nguyenvulebinh/wavlm-bart/resolve/main/sample.wav
print(decode_wav([torchaudio.load('sample.wav')[0].squeeze()], model))

# <|0.06| What are the many parts that make a machine learning system feel like it works so magically cheap? |5.86|>
# <|5.68| Explletability factors important, so they tend to gear towards more simpler models with less parameters, but easier to explain, and on the other spectrum there are |15.86|>

📚 ドキュメント

引用

このリポジトリは、以下の論文のアイデアを使用しています。このモデルを使用して公開された結果を生成する場合、または他のソフトウェアに組み込む場合は、論文を引用してください。

@INPROCEEDINGS{10446589,
  author={Nguyen, Thai-Binh and Waibel, Alexander},
  booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Synthetic Conversations Improve Multi-Talker ASR}, 
  year={2024},
  volume={},
  number={},
  pages={10461-10465},
  keywords={Systematics;Error analysis;Knowledge based systems;Oral communication;Signal processing;Data models;Acoustics;multi-talker;asr;synthetic conversation},
  doi={10.1109/ICASSP48485.2024.10446589}
}