🚀 Wav2Vec2-Large-Tedlium
A fine-tuned Wav2Vec2 large model on the TEDLIUM corpus for speech recognition.
This model is initialized with Facebook's Wav2Vec2 large LV-60k checkpoint, which is pre-trained on 60,000 hours of audiobooks from the LibriVox project. It is fine-tuned on 452 hours of TED talks from the TEDLIUM corpus (Release 3). When using the model, ensure that your speech input is sampled at 16Khz.
The model achieves a word error rate (WER) of 8.4% on the dev set and 8.2% on the test set. The Training logs document the training and evaluation progress over 50k steps of fine-tuning.
For more information on how this model was fine-tuned, see this notebook.
🚀 Quick Start
Prerequisites
- Ensure your speech input is sampled at 16Khz.
Transcribing Audio Files
The model can be used as a standalone acoustic model to transcribe audio files.
💻 Usage Examples
Basic Usage
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch
processor = Wav2Vec2Processor.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium")
model = Wav2Vec2ForCTC.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium")
ds = load_dataset("sanchit-gandhi/tedlium_dummy", split="validation")
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print("Target: ", ds["text"][0])
print("Transcription: ", transcription[0])
Evaluation
The following code snippet shows how to evaluate Wav2Vec2-Large-Tedlium on the TEDLIUM test data.
Advanced Usage
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
from jiwer import wer
tedlium_eval = load_dataset("LIUM/tedlium", "release3", split="test")
model = Wav2Vec2ForCTC.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium").to("cuda")
processor = Wav2Vec2Processor.from_pretrained("sanchit-gandhi/wav2vec2-large-tedlium")
def map_to_pred(batch):
input_values = processor(batch["audio"]["array"], return_tensors="pt", padding="longest").input_values
with torch.no_grad():
logits = model(input_values.to("cuda")).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
batch["transcription"] = transcription
return batch
result = tedlium_eval.map(map_to_pred, batched=True, batch_size=1, remove_columns=["speech"])
print("WER:", wer(result["text"], result["transcription"]))
📄 License
This project is licensed under the Apache-2.0 license.
Property |
Details |
Model Type |
Wav2Vec2 large model fine-tuned on the TEDLIUM corpus |
Training Data |
452h of TED talks from the TEDLIUM corpus (Release 3) |
Tags |
speech |
Datasets |
LIUM/tedlium |