Open-source model of wav2vec2-large-voxrex-swedish-4gram - Free implementation of Swedish automatic speech recognition

Wav2vec2 Large Voxrex Swedish 4gram

Developed by viktor-enzell

This is a model for Swedish automatic speech recognition (ASR), combining the VoxRex-C acoustic model with a 4-gram language model based on social media data.

Speech Recognition

Transformers

Other#Swedish speech recognition #Low word error rate #Social media text optimization

Downloads 5,891

Release Time : 5/26/2022

Model Overview

The model enhances the performance of the VoxRex-C acoustic model by adding a 4-gram language model based on the Swedish Culturomics Billion Word Corpus, specifically designed for Swedish speech recognition tasks.

Model Features

Enhanced language model

Incorporates a 4-gram language model based on 40 million social media words, significantly improving recognition accuracy.

High performance

Achieves a 6.47% word error rate on the Common Voice 6.1 test set.

Pre-trained acoustic model

Based on the VoxRex-C pre-trained model with excellent acoustic feature extraction capabilities.

Model Capabilities

Swedish speech recognition

Audio transcription

16kHz audio processing

Use Cases

Speech transcription

Social media audio transcription

Converts Swedish speech content from social media platforms into text.

Suitable for processing informal spoken expressions.

Voice assistants

Used as a speech recognition component for Swedish voice assistant applications.

High-accuracy voice command recognition.

🚀 KBLab's wav2vec 2.0 large VoxRex Swedish (C) with 4-gram model

This model is an extension of the acoustic model trained by KBLab. It combines the acoustic model with a 4-gram language model for enhanced performance in automatic speech recognition.

✨ Features

Language Support: Specifically designed for Swedish, supporting the automatic speech recognition task.
Enhanced Performance: Extended with a 4-gram language model to improve recognition accuracy.
Multiple Datasets: Trained on multiple datasets, including Common Voice, NST Swedish ASR Database, etc.

📦 Installation

The README does not provide specific installation steps, so this section is skipped.

💻 Usage Examples

Basic Usage

import torch
from transformers import pipeline

# Load the model. Using GPU if available
model_name = 'viktor-enzell/wav2vec2-large-voxrex-swedish-4gram'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
pipe = pipeline(model=model_name).to(device)

# Run inference on an audio file
output = pipe('path/to/audio.mp3')['text']

Advanced Usage

from transformers import Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM
from datasets import load_dataset
import torch
import torchaudio.functional as F

# Import model and processor. Using GPU if available
model_name = 'viktor-enzell/wav2vec2-large-voxrex-swedish-4gram'
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = Wav2Vec2ForCTC.from_pretrained(model_name).to(device);
processor = Wav2Vec2ProcessorWithLM.from_pretrained(model_name)

# Import and process speech data 
common_voice = load_dataset('common_voice', 'sv-SE', split='test[:1%]')

def speech_file_to_array(sample):
    # Convert speech file to array and downsample to 16 kHz
    sampling_rate = sample['audio']['sampling_rate']
    sample['speech'] = F.resample(torch.tensor(sample['audio']['array']), sampling_rate, 16_000)
    return sample

common_voice = common_voice.map(speech_file_to_array)

# Run inference
inputs = processor(common_voice['speech'], sampling_rate=16_000, return_tensors='pt', padding=True).to(device)

with torch.no_grad():
    logits = model(**inputs).logits

transcripts = processor.batch_decode(logits.cpu().numpy()).text

📚 Documentation

Model Description

VoxRex-C is extended with a 4-gram language model estimated from a subset extracted from The Swedish Culturomics Gigaword Corpus from Språkbanken. The subset contains 40M words from the social media genre between 2010 and 2015.

Training Procedure

Text data for the n-gram model is pre-processed by removing characters not part of the wav2vec 2.0 vocabulary and uppercasing all characters. After pre-processing and storing each text sample on a new line in a text file, a KenLM model is estimated. See this tutorial for more details.

Evaluation Results

The model was evaluated on the full Common Voice test set version 6.1. VoxRex-C achieved a WER of 9.03% without the language model and 6.47% with the language model.

🔧 Technical Details

Metrics: Word Error Rate (WER) is used as the evaluation metric.
Tags: Related to audio, automatic speech recognition, speech, etc.
License: CC0-1.0 license.
Datasets: Trained on multiple datasets, including Common Voice, NST Swedish ASR Database, P4, and The Swedish Culturomics Gigaword Corpus.

Property	Details
Model Type	wav2vec 2.0 large VoxRex Swedish (C) with 4-gram
Training Data	Common Voice, NST Swedish ASR Database, P4, The Swedish Culturomics Gigaword Corpus

📄 License

This model is released under the CC0-1.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご