asr_hubert_cluster_bart_base Open-Source Automatic Speech Recognition Model - Supports Efficient Voice-to-Text Conversion

Asr Hubert Cluster Bart Base

Developed by voidful

An automatic speech recognition model based on Hubert and BART architecture, achieving speech-to-text conversion through clustered feature transformation

Speech Recognition

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Speech-to-Text #Hubert Feature Clustering #BART Sequence Generation

Downloads 13

Release Time : 3/2/2022

Model Overview

This model combines Hubert's speech feature extraction capability with BART's sequence-to-sequence transformation ability, specifically designed for automatic speech recognition (ASR) tasks.

Model Features

Hubert Feature Clustering

Uses Hubert to extract speech features and encodes them through k-means clustering

BART Sequence Transformation

Utilizes the BART model to convert clustered feature sequences into text sequences

Efficient Speech Processing

Capable of processing speech inputs at various sample rates and converting them into text

Model Capabilities

English Speech Recognition

Speech Feature Extraction

Sequence-to-Text Conversion

Use Cases

Speech Transcription

Lecture Transcription

Convert recorded lectures into written transcripts

Example result: 'Moving along the muddy country roads, speaking for two weeks in damp schoolhouses to damp audiences...'

Voice Assistants

Voice Command Recognition

Recognize and convert user voice commands into executable commands

🚀 voidful/asr_hubert_cluster_bart_base

This project focuses on automatic speech recognition using Hubert clustering and a BART - based model, leveraging datasets like Librispeech.

🚀 Quick Start

Download Necessary Files

wget https://raw.githubusercontent.com/voidful/hubert-cluster-code/main/km_feat_100_layer_20
wget https://cdn-media.huggingface.co/speech_samples/sample1.flac

💻 Usage Examples

Basic Usage

Generate Hubert K - means Codes

import joblib
import torch
from transformers import Wav2Vec2FeatureExtractor, HubertModel
import soundfile as sf


class HubertCode(object):
    def __init__(self, hubert_model, km_path, km_layer):
        self.processor = Wav2Vec2FeatureExtractor.from_pretrained(hubert_model)
        self.model = HubertModel.from_pretrained(hubert_model)
        self.km_model = joblib.load(km_path)
        self.km_layer = km_layer
        self.C_np = self.km_model.cluster_centers_.transpose()
        self.Cnorm_np = (self.C_np ** 2).sum(0, keepdims=True)

        self.C = torch.from_numpy(self.C_np)
        self.Cnorm = torch.from_numpy(self.Cnorm_np)
        if torch.cuda.is_available():
            self.C = self.C.cuda()
            self.Cnorm = self.Cnorm.cuda()
            self.model = self.model.cuda()

    def __call__(self, filepath, sampling_rate=None):
        speech, sr = sf.read(filepath)
        input_values = self.processor(speech, return_tensors="pt", sampling_rate=sr).input_values
        if torch.cuda.is_available():
            input_values = input_values.cuda()
        hidden_states = self.model(input_values, output_hidden_states=True).hidden_states
        x = hidden_states[self.km_layer].squeeze()
        dist = (
                x.pow(2).sum(1, keepdim=True)
                - 2 * torch.matmul(x, self.C)
                + self.Cnorm
        )
        return dist.argmin(dim=1).cpu().numpy()

hc = HubertCode("facebook/hubert-large-ll60k", './km_feat_100_layer_20', 20)
voice_ids = hc('./sample1.flac')

Load the BART Model

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("voidful/asr_hubert_cluster_bart_base")
model = AutoModelForSeq2SeqLM.from_pretrained("voidful/asr_hubert_cluster_bart_base")

Generate Output

gen_output = model.generate(input_ids=tokenizer("".join([f":vtok{i}:" for i in voice_ids]),return_tensors='pt').input_ids,max_length=1024)
print(tokenizer.decode(gen_output[0], skip_special_tokens=True))

📚 Documentation

Result

The result of the speech recognition is as follows: going along slushy country roads and speaking to damp audience in drifty school rooms day after day for a fortnight he'll have to put in an appearance at some place of worship on sunday morning and he can come to ask immediately afterwards

📄 License

This project is licensed under the Apache - 2.0 license.

Property	Details
Datasets	Librispeech
Tags	audio, automatic - speech - recognition, speech, asr, hubert
License	apache - 2.0
Metrics	wer, cer

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご