đ Wav2vec 2.0 large-voxpopuli-sv-swedish
An additionally pretrained and finetuned model for Swedish automatic speech recognition, leveraging Swedish radio broadcasts, NST, and Common Voice data.
â ī¸ Important Note
this model performs better and has a less restrictive license.
This is an additionally pretrained and finetuned version of Facebook's VoxPopuli-sv large model. It uses Swedish radio broadcasts, NST, and Common Voice data. Evaluation without a language model shows the following results: The WER for the NST + Common Voice test set (2% of total sentences) is 3.95%. The WER for the Common Voice test set is 10.99% directly and 7.82% with a 4-gram language model.
đĄ Usage Tip
When using this model, make sure that your speech input is sampled at 16kHz.
đ Quick Start
⨠Features
- Trained on multiple Swedish datasets including Common Voice and NST Swedish ASR Database.
- Evaluated using metrics like WER and CER.
- Can be used for automatic speech recognition tasks.
đĻ Installation
No specific installation steps provided in the original document.
đģ Usage Examples
Basic Usage
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
test_dataset = load_dataset("common_voice", "sv-SE", split="test[:2%]")
processor = Wav2Vec2Processor.from_pretrained("KBLab/wav2vec2-large-voxpopuli-sv-swedish")
model = Wav2Vec2ForCTC.from_pretrained("KBLab/wav2vec2-large-voxpopuli-sv-swedish")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])
đ§ Technical Details
This model has additionally pretrained on 1000h of Swedish local radio broadcasts, fine-tuned for 120000 updates on NST + CommonVoice and then for an additional 20000 updates on CommonVoice only. The additional fine-tuning on CommonVoice hurts performance on the NST+CommonVoice test set somewhat and, unsurprisingly, improves it on the CommonVoice test set. It seems to perform generally better though [citation needed].
đ License
This model is licensed under cc-by-nc-4.0.
đ Documentation
Property |
Details |
Model Type |
Wav2vec 2.0 large VoxPopuli-sv swedish |
Training Data |
1000h of Swedish local radio broadcasts, NST + CommonVoice data |
Datasets |
common_voice, NST Swedish ASR Database |
Metrics |
wer, cer |
Tags |
audio, automatic-speech-recognition, speech, voxpopuli |
Results |
Task: Speech Recognition (automatic-speech-recognition) Dataset: Common Voice (common_voice, sv-SE) Metrics: - Test WER: 10.994764 - Test CER: 3.946846 |