Whisper-large-et Open-source Speech Recognition Model - Achieve Accurate Estonian Speech Recognition for Free

Whisper Large Et

Developed by TalTechNLP

Estonian speech recognition model fine-tuned from OpenAI Whisper-large-v2, developed by Tallinn University of Technology, trained on approximately 1,200 hours of data

Speech Recognition

Transformers

Other#Estonian speech recognition #Broadcast speech optimization #Multi-scenario ASR

Downloads 245

Release Time : 4/13/2023

Model Overview

A general-purpose Estonian automatic speech recognition (ASR) model suitable for speech-to-text tasks in various scenarios such as broadcast dialogues, interviews, and lectures

Model Features

High-precision Estonian recognition

Achieves excellent performance with WER 11.35-12.03 on the Common Voice test set

Diverse training data

Trained on approximately 1,200 hours of diverse Estonian data including broadcasts, speeches, and parliamentary records

Based on Whisper architecture

Fine-tuned from the industry-leading Whisper-large-v2 model, inheriting its excellent architectural features

Model Capabilities

Estonian speech-to-text

Broadcast speech recognition

Lecture transcription

Interview transcript generation

Use Cases

Media content processing

Broadcast program transcription

Automatically convert Estonian broadcast programs into text transcripts

High-precision transcription with WER around 12%

Interview transcript generation

Automatically generate text records of interview dialogues

Educational applications

Lecture transcription

Automatically convert university lecture content into text

🚀 Whisper-large-et

This is a finetuned Whisper-large-v2 model on Estonian data, designed for general - purpose speech recognition.

🚀 Quick Start

Recommended: use faster-whisper.

For example:

Convert the HF model to CT2 format:

ct2-transformers-converter --model TalTechNLP/whisper-large-et --output_dir whisper-large-et.ct2 --copy_files tokenizer.json --quantization float16
Decode:

whisper-ctranslate2 --model_directory whisper-large-et.ct2 --task transcribe --language et --beam_size 5 some_file.mp3

✨ Features

This is a general - purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.
It is finetuned on around 1200 hours of diverse Estonian data.
The 2023 - 10 - 03 version of the model is trained on long segments, well - suited for "end - to - end" transcription of long speech recordings.

📦 Installation

There is no specific installation process described other than the conversion and decoding steps in the "Quick Start" section.

💻 Usage Examples

Basic Usage

# Convert the HF model to CT2 format
ct2-transformers-converter --model TalTechNLP/whisper-large-et --output_dir whisper-large-et.ct2  --copy_files tokenizer.json --quantization float16

# Decode
whisper-ctranslate2 --model_directory whisper-large-et.ct2 --task transcribe --language et --beam_size 5 some_file.mp3

📚 Documentation

Model description

This is a general - purpose Estonian ASR model trained in the Lab of Language Technology at TalTech.

Intended uses & limitations

This model is intended for general - purpose speech recognition, such as broadcast conversations, interviews, talks, etc.

Limitations and bias

Since this model was trained on mostly broadcast speech and texts from the web, it might have problems correctly decoding the following:

Speech containing technical and other domain - specific terms
Children's speech
Non - native speech
Speech recorded under very noisy conditions or with a microphone far from the speaker
Very spontaneous and overlapping speech

Training data

Acoustic training data:

Type	Amount (h)
Broadcast speech	991
Spontaneous speech	53
Elderly speech corpus	53
Talks, lectures	49
Parliament speeches	31
Total	1161

Training procedure

Finetuned using Espnet, and then comverted to transformers format using this script. Finetuning procedure is similar to this model. Finetuning was done for 3 epochs, with model averaging at the end of training.

Update: 2023 - 10 - 03 version of the model is trained on long segments (like the original Whisper model) and is therefore especially well suited to be used e.g. with faster-whisper to transcribe long speech recordings "end - to - end" (i.e., without any prior segmentation).

Evaluation results

WER

WER results below are obtained using greedy decoding (i.e., beam size 1).

Dataset	WER
Common Voice 8.0	11.3
Common Voice 11.0	12.0

📄 License

This model is licensed under CC BY 4.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご