đ wav2vec2-base-da-ft-nst
This is a Danish Automatic Speech Recognition (ASR) model based on alvenir wav2vec2 model, fine - tuned by Alvenir on the public NST dataset. It offers high - quality speech - to - text conversion for Danish.
đ Quick Start
This model is trained on 16kHz audio data. Ensure that your input data has the same sample rate. It was initially trained using fairseq and then converted to the huggingface/transformers format.
Alvenir is always willing to assist with your open - source ASR projects, customized domain specializations, or premium models. ;-)
⨠Features
- Danish ASR: Specifically fine - tuned for Danish language speech - to - text tasks.
- Sample Rate Requirement: Trained on 16kHz audio, ensuring compatibility with similar - sampled data.
- Format Compatibility: Converted to the huggingface/transformers format for easy integration.
đĻ Installation
No specific installation steps are provided in the original README. If you want to use this model, you need to install relevant Python libraries such as transformers
, soundfile
, and torch
. You can use the following command to install the transformers
library:
pip install transformers
You may also need to install soundfile
and torch
according to your environment:
pip install soundfile torch
đģ Usage Examples
Basic Usage
import soundfile as sf
import torch
from transformers import Wav2Vec2CTCTokenizer, Wav2Vec2Tokenizer, Wav2Vec2Processor, \
Wav2Vec2ForCTC
def get_tokenizer(model_path: str) -> Wav2Vec2CTCTokenizer:
return Wav2Vec2Tokenizer.from_pretrained(model_path)
def get_processor(model_path: str) -> Wav2Vec2Processor:
return Wav2Vec2Processor.from_pretrained(model_path)
def load_model(model_path: str) -> Wav2Vec2ForCTC:
return Wav2Vec2ForCTC.from_pretrained(model_path)
model_id = "Alvenir/wav2vec2-base-da-ft-nst"
model = load_model(model_id)
model.eval()
tokenizer = get_tokenizer(model_id)
processor = get_processor(model_id)
audio_file = "<path/to/audio.wav>"
audio, _ = sf.read(audio_file)
input_values = processor(audio, return_tensors="pt", padding="longest", sampling_rate=16_000).input_values
with torch.no_grad():
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription)
đ Documentation
Benchmark results
Here are some benchmark results on publicly available Danish datasets.
Dataset |
WER Greedy |
WER with 3 - gram Language Model |
NST test |
15.8% |
11.9% |
alvenir - asr - da - eval |
19.0% |
12.1% |
common_voice_80 da test |
26.3% |
19.2% |
đ License
This project is licensed under the Apache - 2.0 license.