Open-source wav2vec2 - bartpho model supporting Vietnamese automatic speech recognition and text normalization tagging

Wav2vec2 Bartpho

Developed by nguyenvulebinh

This is an automatic speech recognition model supporting Vietnamese, capable of outputting normalized text, timestamp labeling, and multi-speaker segmentation.

Speech Recognition

Transformers

Other#Vietnamese speech recognition #Timestamp labeling #Multi-speaker segmentation

Downloads 472

Release Time : 10/5/2023

Model Overview

This model is based on the wav2vec2 and bartpho architecture, specifically designed for Vietnamese automatic speech recognition tasks, supporting timestamped text output and multi-speaker segmentation.

Model Features

Timestamp Labeling

Capable of marking precise timestamps for recognized text

Multi-speaker Segmentation

Supports identification and segmentation of speech from different speakers

Text Normalization

Outputs normalized recognized text

Model Capabilities

Vietnamese speech recognition

Timestamp labeling

Multi-speaker segmentation

Text normalization output

Use Cases

Speech Transcription

News Transcription

Transcribing Vietnamese news broadcasts into timestamped text

Sample output includes precise time markers and segmentation

Meeting Minutes

Multi-speaker Meeting Minutes

Automatically identifying and segmenting speech from different speakers in meetings

Can distinguish between different speakers and mark speaking times

🚀 Vietnamese ASR Sequence-to-Sequence Model

This is a Vietnamese Automatic Speech Recognition (ASR) sequence-to-sequence model. It supports outputting normalized text, labeling timestamps, and segmenting multiple speakers, providing comprehensive solutions for speech processing tasks.

🚀 Quick Start

Installation

First, you need to install the necessary libraries. You can use the following command:

# !pip install transformers, sentencepiece

Usage Examples

Basic Usage

from transformers import SpeechEncoderDecoderModel
from transformers import AutoFeatureExtractor, AutoTokenizer, GenerationConfig
import torchaudio
import torch

model_path = 'nguyenvulebinh/wav2vec2-bartpho'
model = SpeechEncoderDecoderModel.from_pretrained(model_path).eval()
feature_extractor = AutoFeatureExtractor.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
if torch.cuda.is_available():
  model = model.cuda()


def decode_tokens(token_ids, skip_special_tokens=True, time_precision=0.02):
    timestamp_begin = tokenizer.vocab_size
    outputs = [[]]
    for token in token_ids:
        if token >= timestamp_begin:
            timestamp = f" |{(token - timestamp_begin) * time_precision:.2f}| "
            outputs.append(timestamp)
            outputs.append([])
        else:
            outputs[-1].append(token)
    outputs = [
        s if isinstance(s, str) else tokenizer.decode(s, skip_special_tokens=skip_special_tokens) for s in outputs
    ]
    return "".join(outputs).replace("< |", "<|").replace("| >", "|>")

def decode_wav(audio_wavs, asr_model, prefix=""):
  device = next(asr_model.parameters()).device
  input_values = feature_extractor.pad(
    [{"input_values": feature} for feature in audio_wavs],
    padding=True,
    max_length=None,
    pad_to_multiple_of=None,
    return_tensors="pt",
  )

  output_beam_ids = asr_model.generate(
    input_values['input_values'].to(device), 
    attention_mask=input_values['attention_mask'].to(device),
    decoder_input_ids=tokenizer.batch_encode_plus([prefix] * len(audio_wavs), return_tensors="pt")['input_ids'][..., :-1].to(device),
    generation_config=GenerationConfig(decoder_start_token_id=tokenizer.bos_token_id),
    max_length=250, 
    num_beams=25, 
    no_repeat_ngram_size=4, 
    num_return_sequences=1, 
    early_stopping=True,
    return_dict_in_generate=True,
    output_scores=True,
  )

  output_text = [decode_tokens(sequence) for sequence in output_beam_ids.sequences]

  return output_text


# https://huggingface.co/nguyenvulebinh/wav2vec2-bartpho/resolve/main/sample_news.wav
print(decode_wav([torchaudio.load('sample_news.wav')[0].squeeze()], model))

# <|0.00| Gia đình cho biết, nhiều lần đã từng gọi điện báo chính quyền và lực lượng an ninh địa phương nhưng đều không có tác dụng |7.00|>
# <|8.14| Không ai giúp đỡ được mình một chút nào cả, nên là lúc đó là lúc tuyệt vọng nhất, nó tra tấn mình cực kỳ khổ, gây cái tâm lý ức chế rất là nhiều, rất là lớn |19.02|>

📄 License

This project is licensed under the CC BY-NC 4.0 license.

📚 Citation

This repository uses the idea from the following paper. Please cite the paper if this model is used to help produce published results or is incorporated into other software.

@INPROCEEDINGS{10446589,
  author={Nguyen, Thai-Binh and Waibel, Alexander},
  booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Synthetic Conversations Improve Multi-Talker ASR}, 
  year={2024},
  volume={},
  number={},
  pages={10461-10465},
  keywords={Systematics;Error analysis;Knowledge based systems;Oral communication;Signal processing;Data models;Acoustics;multi-talker;asr;synthetic conversation},
  doi={10.1109/ICASSP48485.2024.10446589}
}

📞 Contact

If you have any questions, please contact nguyenvulebinh@gmail.com.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご