wav2vec2-base-timit-asr Open-source Speech Recognition Model - Supports 16kHz Speech Input, Precise Recognition

Wav2vec2 Base Timit Asr

Developed by elgeish

A speech recognition model fine-tuned on the timit_asr dataset based on facebook/wav2vec2-base, supporting 16kHz sampled audio input

Speech Recognition

Transformers

EnglishOpen Source License:Apache-2.0 #English Speech Recognition #TIMIT Dataset #No Language Model

Downloads 174

Release Time : 3/2/2022

Model Overview

This is an automatic speech recognition (ASR) model specifically optimized for the TIMIT dataset, capable of converting English speech to text

Model Features

No Language Model Required

This model can be used directly without additional language model support

16kHz Sampling Rate Support

Specifically optimized for processing 16kHz sampled audio input

TIMIT Dataset Optimization

Fine-tuned specifically on the TIMIT ASR dataset

Model Capabilities

English Speech Recognition

Speech-to-Text

Automatic Speech Transcription

Use Cases

Speech Transcription

Speech to Text

Convert English speech to text format

As shown in the examples, it can accurately transcribe most content, though there may be minor errors on certain words

Speech Analysis

Speech Content Analysis

Analyze speech content to extract key information

🚀 Wav2Vec2-Base-TIMIT

This project fine-tunes the facebook/wav2vec2-base model on the timit_asr dataset. When using this model, ensure that your speech input is sampled at 16kHz.

🚀 Quick Start

This model is fine - tuned from facebook/wav2vec2-base on the timit_asr dataset. Remember to sample your speech input at 16kHz when using this model.

✨ Features

Fine - tuned on the timit_asr dataset for automatic speech recognition.
Can be used directly without a language model.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import soundfile as sf
import torch
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

model_name = "elgeish/wav2vec2-base-timit-asr"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)
model.eval()

dataset = load_dataset("timit_asr", split="test").shuffle().select(range(10))
char_translations = str.maketrans({"-": " ", ",": "", ".": "", "?": ""})

def prepare_example(example):
    example["speech"], _ = sf.read(example["file"])
    example["text"] = example["text"].translate(char_translations)
    example["text"] = " ".join(example["text"].split())  # clean up whitespaces
    example["text"] = example["text"].lower()
    return example

dataset = dataset.map(prepare_example, remove_columns=["file"])
inputs = processor(dataset["speech"], sampling_rate=16000, return_tensors="pt", padding="longest")

with torch.no_grad():
    predicted_ids = torch.argmax(model(inputs.input_values).logits, dim=-1)
predicted_ids[predicted_ids == -100] = processor.tokenizer.pad_token_id  # see fine-tuning script
predicted_transcripts = processor.tokenizer.batch_decode(predicted_ids)

for reference, predicted in zip(dataset["text"], predicted_transcripts):
    print("reference:", reference)
    print("predicted:", predicted)
    print("--")

Output Example

reference: she had your dark suit in greasy wash water all year
predicted: she had your dark suit in greasy wash water all year
--
reference: where were you while we were away
predicted: where were you while we were away
--
reference: cory and trish played tag with beach balls for hours
predicted: tcory and trish played tag with beach balls for hours
--
reference: tradition requires parental approval for under age marriage
predicted: tradition requires parrental proval for under age marrage
--
reference: objects made of pewter are beautiful
predicted: objects made of puder are bautiful
--
reference: don't ask me to carry an oily rag like that
predicted: don't o ask me to carry an oily rag like that
--
reference: cory and trish played tag with beach balls for hours
predicted: cory and trish played tag with beach balls for ours
--
reference: don't ask me to carry an oily rag like that
predicted: don't ask me to carry an oily rag like that
--
reference: don't do charlie's dirty dishes
predicted: don't  do chawly's tirty dishes
--
reference: only those story tellers will remain who can imitate the style of the virtuous
predicted: only those story tillaers will remain who can imvitate the style the virtuous

📚 Documentation

You can find the script used to produce this model here.

🔧 Technical Details

No technical details are provided in the original document, so this section is skipped.

📄 License

This model is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご