🚀 KBLab's wav2vec 2.0 large VoxRex Swedish (C) with 4-gram model
This model is an extension of the acoustic model trained by KBLab. It combines the acoustic model with a 4-gram language model for enhanced performance in automatic speech recognition.
✨ Features
- Language Support: Specifically designed for Swedish, supporting the automatic speech recognition task.
- Enhanced Performance: Extended with a 4-gram language model to improve recognition accuracy.
- Multiple Datasets: Trained on multiple datasets, including Common Voice, NST Swedish ASR Database, etc.
📦 Installation
The README does not provide specific installation steps, so this section is skipped.
💻 Usage Examples
Basic Usage
import torch
from transformers import pipeline
model_name = 'viktor-enzell/wav2vec2-large-voxrex-swedish-4gram'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
pipe = pipeline(model=model_name).to(device)
output = pipe('path/to/audio.mp3')['text']
Advanced Usage
from transformers import Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM
from datasets import load_dataset
import torch
import torchaudio.functional as F
model_name = 'viktor-enzell/wav2vec2-large-voxrex-swedish-4gram'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device);
processor = Wav2Vec2ProcessorWithLM.from_pretrained(model_name)
common_voice = load_dataset('common_voice', 'sv-SE', split='test[:1%]')
def speech_file_to_array(sample):
sampling_rate = sample['audio']['sampling_rate']
sample['speech'] = F.resample(torch.tensor(sample['audio']['array']), sampling_rate, 16_000)
return sample
common_voice = common_voice.map(speech_file_to_array)
inputs = processor(common_voice['speech'], sampling_rate=16_000, return_tensors='pt', padding=True).to(device)
with torch.no_grad():
logits = model(**inputs).logits
transcripts = processor.batch_decode(logits.cpu().numpy()).text
📚 Documentation
Model Description
VoxRex-C is extended with a 4-gram language model estimated from a subset extracted from The Swedish Culturomics Gigaword Corpus from Språkbanken. The subset contains 40M words from the social media genre between 2010 and 2015.
Training Procedure
Text data for the n-gram model is pre-processed by removing characters not part of the wav2vec 2.0 vocabulary and uppercasing all characters. After pre-processing and storing each text sample on a new line in a text file, a KenLM model is estimated. See this tutorial for more details.
Evaluation Results
The model was evaluated on the full Common Voice test set version 6.1. VoxRex-C achieved a WER of 9.03% without the language model and 6.47% with the language model.
🔧 Technical Details
- Metrics: Word Error Rate (WER) is used as the evaluation metric.
- Tags: Related to audio, automatic speech recognition, speech, etc.
- License: CC0-1.0 license.
- Datasets: Trained on multiple datasets, including Common Voice, NST Swedish ASR Database, P4, and The Swedish Culturomics Gigaword Corpus.
Property |
Details |
Model Type |
wav2vec 2.0 large VoxRex Swedish (C) with 4-gram |
Training Data |
Common Voice, NST Swedish ASR Database, P4, The Swedish Culturomics Gigaword Corpus |
📄 License
This model is released under the CC0-1.0 license.