WavLM - BART Open-source Automatic Speech Recognition Model - Free Implementation of English Text Output, Timestamp Annotation, and Multi-speaker Segmentation

Wavlm Bart

Developed by nguyenvulebinh

A sequence-to-sequence model supporting English automatic speech recognition (ASR), capable of outputting normalized text, timestamp annotations, and multi-speaker segmentation.

Speech Recognition

Transformers

English#English speech recognition #Timestamp annotation #Multi-speaker segmentation

Downloads 24

Release Time : 5/23/2023

Model Overview

This model is based on the wav2vec2 and bartpho architectures, primarily designed for English speech recognition tasks, supporting output with timestamped text and multi-speaker segmentation.

Model Features

Timestamp annotation

Capable of annotating recognized text with precise timestamps.

Multi-speaker segmentation

Supports recognizing and segmenting speech from different speakers.

Normalized text output

Outputs normalized text results.

Model Capabilities

English speech recognition

Timestamp annotation

Multi-speaker segmentation

Use Cases

Speech transcription

Meeting minutes

Convert meeting recordings into timestamped text records.

Accurately recognizes speech content and annotates speaking time points.

Interview transcription

Transcribe interview recordings and distinguish between different speakers.

Automatically segments speeches from different interviewees.

🚀 English ASR Sequence-to-Sequence Model

This is an English ASR sequence-to-sequence model. It supports output normalizing text, labeling timestamps, and segmenting multiple speakers.

🚀 Quick Start

Installation

First, you need to install the necessary libraries:

# !pip install transformers sentencepiece

Basic Usage

from transformers import SpeechEncoderDecoderModel
from transformers import AutoFeatureExtractor, AutoTokenizer, GenerationConfig
import torchaudio
import torch

model_path = 'nguyenvulebinh/wav2vec2-bartpho'
model = SpeechEncoderDecoderModel.from_pretrained(model_path).eval()
feature_extractor = AutoFeatureExtractor.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
if torch.cuda.is_available():
  model = model.cuda()


def decode_tokens(token_ids, skip_special_tokens=True, time_precision=0.02):
    timestamp_begin = tokenizer.vocab_size
    outputs = [[]]
    for token in token_ids:
        if token >= timestamp_begin:
            timestamp = f" |{(token - timestamp_begin) * time_precision:.2f}| "
            outputs.append(timestamp)
            outputs.append([])
        else:
            outputs[-1].append(token)
    outputs = [
        s if isinstance(s, str) else tokenizer.decode(s, skip_special_tokens=skip_special_tokens) for s in outputs
    ]
    return "".join(outputs).replace("< |", "<|").replace("| >", "|>")

def decode_wav(audio_wavs, asr_model, prefix=""):
  device = next(asr_model.parameters()).device
  input_values = feature_extractor.pad(
    [{"input_values": feature} for feature in audio_wavs],
    padding=True,
    max_length=None,
    pad_to_multiple_of=None,
    return_tensors="pt",
  )

  output_beam_ids = asr_model.generate(
    input_values['input_values'].to(device), 
    attention_mask=input_values['attention_mask'].to(device),
    decoder_input_ids=tokenizer.batch_encode_plus([prefix] * len(audio_wavs), return_tensors="pt")['input_ids'][..., :-1].to(device),
    generation_config=GenerationConfig(decoder_start_token_id=tokenizer.bos_token_id),
    max_length=250, 
    num_beams=25, 
    no_repeat_ngram_size=4, 
    num_return_sequences=1, 
    early_stopping=True,
    return_dict_in_generate=True,
    output_scores=True,
  )

  output_text = [decode_tokens(sequence) for sequence in output_beam_ids.sequences]

  return output_text


# https://huggingface.co/nguyenvulebinh/wavlm-bart/resolve/main/sample.wav
print(decode_wav([torchaudio.load('sample.wav')[0].squeeze()], model))

# <|0.06| What are the many parts that make a machine learning system feel like it works so magically cheap? |5.86|>
# <|5.68| Explletability factors important, so they tend to gear towards more simpler models with less parameters, but easier to explain, and on the other spectrum there are |15.86|>

📄 License

This project is licensed under the CC BY-NC 4.0 license.

📚 Citation

This repository uses the idea from the following paper. Please cite the paper if this model is used to help produce published results or is incorporated into other software.

@INPROCEEDINGS{10446589,
  author={Nguyen, Thai-Binh and Waibel, Alexander},
  booktitle={ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
  title={Synthetic Conversations Improve Multi-Talker ASR}, 
  year={2024},
  volume={},
  number={},
  pages={10461-10465},
  keywords={Systematics;Error analysis;Knowledge based systems;Oral communication;Signal processing;Data models;Acoustics;multi-talker;asr;synthetic conversation},
  doi={10.1109/ICASSP48485.2024.10446589}
}

📞 Contact

You can contact the author via email: nguyenvulebinh@gmail.com

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご