オープンソースのwav2vec2 - bartphoモデルは、ベトナム語の自動音声認識とテキスト規範マーキングをサポートします。

ホーム

Wav2vec2 Bartpho

nguyenvulebinhによって開発

これはベトナム語をサポートする自動音声認識モデルで、正規化されたテキストの出力、タイムスタンプの付与、および複数話者のセグメンテーションが可能です。

音声認識

Transformers

その他#ベトナム語音声認識 #タイムスタンプ付与 #複数話者セグメンテーション

ダウンロード数 472

リリース時間 : 10/5/2023

モデル概要

このモデルはwav2vec2とbartphoアーキテクチャに基づいており、ベトナム語の自動音声認識タスク専用に設計されており、タイムスタンプ付きテキストと複数話者セグメンテーションの出力をサポートします。

モデル特徴

タイムスタンプ付与

認識されたテキストに正確なタイムスタンプを付与可能

複数話者セグメンテーション

異なる話者の音声を認識しセグメント化する機能をサポート

テキスト正規化

正規化された認識テキストを出力

モデル能力

ベトナム語音声認識

タイムスタンプ付与

複数話者セグメンテーション

テキスト正規化出力

使用事例

音声文字起こし

ニュース文字起こし

ベトナム語ニュース放送をタイムスタンプ付きテキストに変換

出力例には正確な時間マーキングとセグメンテーションが含まれる

会議記録

複数話者会議記録

会議中の異なる発言者の音声を自動認識しセグメント化

異なる話者を区別し発言時間をマーク可能

🚀 ベトナム語自動音声認識（ASR）シーケンスツーシーケンスモデル

このモデルは、正規化されたテキストの出力、タイムスタンプのラベリング、および複数話者のセグメンテーションをサポートしています。

🚀 クイックスタート

以下のコードを使用して、このモデルを使い始めることができます。

💻 使用例

基本的な使用法

# !pip install transformers, sentencepiece

from transformers import SpeechEncoderDecoderModel
from transformers import AutoFeatureExtractor, AutoTokenizer, GenerationConfig
import torchaudio
import torch

model_path = 'nguyenvulebinh/wav2vec2-bartpho'
model = SpeechEncoderDecoderModel.from_pretrained(model_path).eval()
feature_extractor = AutoFeatureExtractor.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
if torch.cuda.is_available():
  model = model.cuda()


def decode_tokens(token_ids, skip_special_tokens=True, time_precision=0.02):
    timestamp_begin = tokenizer.vocab_size
    outputs = [[]]
    for token in token_ids:
        if token >= timestamp_begin:
            timestamp = f" |{(token - timestamp_begin) * time_precision:.2f}| "
            outputs.append(timestamp)
            outputs.append([])
        else:
            outputs[-1].append(token)
    outputs = [
        s if isinstance(s, str) else tokenizer.decode(s, skip_special_tokens=skip_special_tokens) for s in outputs
    ]
    return "".join(outputs).replace("< |", "<|").replace("| >", "|>")

def decode_wav(audio_wavs, asr_model, prefix=""):
  device = next(asr_model.parameters()).device
  input_values = feature_extractor.pad(
    [{"input_values": feature} for feature in audio_wavs],
    padding=True,
    max_length=None,
    pad_to_multiple_of=None,
    return_tensors="pt",
  )

  output_beam_ids = asr_model.generate(
    input_values['input_values'].to(device), 
    attention_mask=input_values['attention_mask'].to(device),
    decoder_input_ids=tokenizer.batch_encode_plus([prefix] * len(audio_wavs), return_tensors="pt")['input_ids'][..., :-1].to(device),
    generation_config=GenerationConfig(decoder_start_token_id=tokenizer.bos_token_id),
    max_length=250, 
    num_beams=25, 
    no_repeat_ngram_size=4, 
    num_return_sequences=1, 
    early_stopping=True,
    return_dict_in_generate=True,
    output_scores=True,
  )

  output_text = [decode_tokens(sequence) for sequence in output_beam_ids.sequences]

  return output_text


# https://huggingface.co/nguyenvulebinh/wav2vec2-bartpho/resolve/main/sample_news.wav
print(decode_wav([torchaudio.load('sample_news.wav')[0].squeeze()], model))

# <|0.00| Gia đình cho biết, nhiều lần đã từng gọi điện báo chính quyền và lực lượng an ninh địa phương nhưng đều không có tác dụng |7.00|>
# <|8.14| Không ai giúp đỡ được mình một chút nào cả, nên là lúc đó là lúc tuyệt vọng nhất, nó tra tấn mình cực kỳ khổ, gây cái tâm lý ức chế rất là nhiều, rất là lớn |19.02|>

📄 ライセンス

このモデルはCC BY-NC 4.0ライセンスの下で提供されています。

📚 引用

このリポジトリは、以下の論文のアイデアを使用しています。このモデルを使用して公開された結果を生成するか、他のソフトウェアに組み込む場合は、この論文を引用してください。

@INPROCEEDINGS{10446589,
  author={Nguyen, Thai-Binh and Waibel, Alexander},
  booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Synthetic Conversations Improve Multi-Talker ASR}, 
  year={2024},
  volume={},
  number={},
  pages={10461-10465},
  keywords={Systematics;Error analysis;Knowledge based systems;Oral communication;Signal processing;Data models;Acoustics;multi-talker;asr;synthetic conversation},
  doi={10.1109/ICASSP48485.2024.10446589}
}